discrete_feature

Classes

DiscreteFeature(name, affinity, is_null)

Representation of a discrete data feature.

class DiscreteFeature(name: str, affinity: str, is_null: bool)[source]

Representation of a discrete data feature.

name

The name of the feature.

affinity

The SQLite3 type affinity of the feature.

Type

{“NUMERIC”, “INTEGER”, “REAL”, “TEXT”, “BLOB”}

is_null

True if feature data can be null, False otherwise.

compare_feature_info(sample_data: pandas.core.frame.DataFrame, simulation_data: pandas.core.frame.DataFrame) float[source]

Uses KL-divergence to compare discrete features.

Parameters
  • sample_data – Loaded sample data.

  • simulation_data – Loaded tumor data.

Returns

Result of KL divergence that are keyed by the category.

compare_feature_stat(sample_data: pandas.core.frame.DataFrame, simulation_data: pandas.core.frame.DataFrame) Union[Dict[str, Any], float][source]

Uses statistical tests to compare discrete features.

Uses hypergeometric test to compare discrete feature between sample and true distributions. Hypergeometric distribution describes the probability of k successes in N draws, without replacement, from a finite population of size M that contains exactly n objects.

Parameters
  • sample_data – Loaded sample data.

  • simulation_data – Loaded tumor data.

Returns

Result of statistical tests that are keyed by the category.

feature_type = 'discrete'

Type of the feature.

Type

string

static get_count(data: list, category: str) int[source]

Returns the number of categories of the feature.

Parameters
  • data – Loaded data.

  • category – Categories of data.

Returns

Number of categories of the feature.

write_feature_data(data_list: list, sample_data: pandas.core.frame.DataFrame, simulation_data: pandas.core.frame.DataFrame) List[Any][source]

Uses KL-divergence compare continuous features.

Parameters
  • data_list – List of data in analysis table.

  • sample_data – Loaded sample data.

  • simulation_data – Loaded tumor data.

Returns

List of data needed for analysis dataframe.