cellmaps_hierarchyeval package

Submodules

cellmaps_hierarchyeval.analysis module

class cellmaps_hierarchyeval.analysis.Assembly(node_id=None, gene_names=None)[source]

Bases: object

Represents assembly in a hierarchy

Constructor

Parameters:

node_id (int) – Id of hierarchy node
gene_names (list) – list of gene names

get_assembly_name()[source]: Gets name of assembly :return:

get_gene_names()[source]

Gets gene names

Returns:

get_node_id()[source]

Gets node id

Returns:

set_assembly_name()[source]: Sets assembly name :return:

class cellmaps_hierarchyeval.analysis.FakeGeneSetAgent(random_seed=None, attribute_name_prefix=None)[source]

Bases: GenesetAgent

Fake geneset agent that generates random numbers for values

Constructor :param random_seed:

annotate_gene_set(gene_names=None)[source]

Parameters:: gene_names
Returns:

class cellmaps_hierarchyeval.analysis.GenesetAgent(attribute_name_prefix=None)[source]

Bases: object

Represents a Gene set analysis agent whose job is to consume a list of gene names and return a term name, confidence score, and analysis

Constructor

GENE_SET_TOKEN = 'GENE_SET'

annotate_gene_set(gene_names=None)[source]

Should be implemented by subclasses

Parameters:: gene_names – gene symbols
Returns:
Return type:: tuple

get_attribute_name_prefix()[source]

Gets suggested attribute name prefix

Returns:
Return type:: str

class cellmaps_hierarchyeval.analysis.Hierarchy(hierarchy=None, interactome=None, ndex_username=None, ndex_password=None)[source]

Bases: object

Represents an assembly of proteins in a Hierarchy

Constructor :param hierarchy: Hierarchy :type hierarchy: CX2Network :param interactome: Parent interactome :type interactome: CX2Network :param ndex_username: NDEx username to use when connecting

to NDEx to obtain interactomes from hierarchy

Parameters:: ndex_password (str) – NDEx password to use when connecting to NDEx to obtain interactomes from hierarchy

get_next_assembly()[source]

Generator that gets next assembly in hierarchy

Returns:
Return type:: Assembly

class cellmaps_hierarchyeval.analysis.OllamaCommandLineGeneSetAgent(prompt=None, model='llama2:latest', ollama_binary='/usr/local/bin/ollama', attribute_name_prefix=None)[source]

Bases: GenesetAgent

Runs

Constructor

Parameters:: prompt (str) – Prompt to pass to LLM put @@GENE_SET@@ into prompt to denote where gene set should be inserted. If None default internal prompt is used

DEFAULT_PROMPT_FILE = 'default_prompt.txt'

annotate_gene_set(gene_names=None)[source]

Using prompt passed in via constructor, this call invokes the LLM specified by model set in constructor

Parameters:: gene_names (list) – Genes to analyze
Raises:: CellmapshierarchyevalError – If LLM failed to run
Returns:: (‘process name (score)’, full output from LLM)
Return type:: tuple

static get_default_prompt()[source]

Gets default prompt stored with this package

Returns:
Return type:: str

get_prompt()[source]: Gets prompt used by this agent :return:

class cellmaps_hierarchyeval.analysis.OllamaRestServiceGenesetAgent(prompt=None, model='llama2:latest', username=None, password=None, rest_url=None, temperature=0, max_tokens=1000, seed=42, attribute_name_prefix=None, max_retries=5, timeout=120, retry_wait=10)[source]

Bases: GenesetAgent

Calls LLM via REST service. Derived from ServerModel_LLM in https://github.com/idekerlab/agent_evaluation llm.py

Constructor

Parameters:

prompt (str) – Prompt to send to LLM
model (str) – Name of model
username (str) – Username to send via Basic Auth to service
password (str) – Password to send via Basic Auth to service
rest_url (str) – URL for service, should end with api/generate
temperature
max_tokens
seed
attribute_name_prefix
max_retries (int) – Number of times to retry failed query
timeout (int or float) – Time in seconds to wait for response from service
retry_wait (int or float) – Time in seconds to wait between retries for failed query

annotate_gene_set(gene_names=None)[source]

Using prompt passed in via constructor, this call invokes the LLM specified by model set in constructor

Parameters:: gene_names (list) – Genes to analyze
Raises:: CellmapshierarchyevalError – If LLM failed to run
Returns:: (‘process name (score)’, full output from LLM)
Return type:: tuple

get_prompt()[source]: Gets prompt used by this agent :return:

cellmaps_hierarchyeval.perturb module

class cellmaps_hierarchyeval.perturb.PerturbSeqAnalysis(hierarchy, hierarchy_parent=None)[source]

Bases: object

Contains utilities to compare Perturbation data against hierarchy passed in via constructor

Constructor

Parameters:

hierarchy (CX2Network)
hierarchy_parent (CX2Network)

static compare_cluster_root_similarities(cluster_functional_data_similarity, root_functional_data_similarity)[source]

Performs a rank-sum test to compare the distribution of functional data similarity scores between a specific cluster and gene pairs in root. This test helps determine if the similarity scores in the cluster are statistically significantly greater than those in the root.

Parameters:

cluster_functional_data_similarity (numpy.array) – An array of similarity scores within a specific cluster.
root_functional_data_similarity (list) – A list of non-NaN similarity scores for gene pairs not directly related in the root.

Returns:

A tuple containing the test statistic and the p-value of the rank-sum test.

Return type:

(float, float)

get_cluster_similarity(functional_data_similarity, hier_system_node_id)[source]

Retrieves the upper triangle similarity scores for genes within a specific cluster of a hierarchy. The scores are extracted from a DataFrame that contains scaled cosine similarity scores for genes that overlap between communities direct to root and Perturb-seq data.

Parameters:

functional_data_similarity (pandas.DataFrame) – A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root and Perturb-seq data.
hier_system_node_id (int) – The identifier for a specific node within a hierarchy.

Returns:

An array of similarity scores from the upper triangle portion of the matrix for the specified cluster.

Return type:

numpy.array()

get_heatmap_for_given_hierarchy_system(hier_system_node_id, perturbseq_df, num_perturb_seq=25)[source]

Given an id for a system in hierarchy hier_system_node_id and Perturb-seq data perturbseq_df create a heatmap of X most variable Perturb-seq proteins.

This is done by filtering perturbseq_df for rows that match genes in given system and then keeping num_perturb_seq most variable columns

Parameters:

hier_system_node_id (int) – node id system to analyze
perturbseq_df (pandas.DataFrame)
num_perturb_seq (int)

Returns:

heat map table

Return type:

pandas.DataFrame

static get_root_functional_data_similarity(functional_data_similarity, overlap_root_pairs)[source]

Extracts and returns a list of functional similarity scores for gene pairs that are not in the same community,: based on a filtered upper triangle extraction of the similarity matrix (ensures that only unique, non-redundant gene pair comparisons are considered).

Parameters:

functional_data_similarity (pandas.DataFrame) – A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root and Perturb-seq data.
overlap_root_pairs (pandas.DataFrame) – A DataFrame of root-associated similarity scores, filtered to only include overlapping genes. A score of 0 indicates a direct relation (same community) and scores greater than 0 indicate no direct relation

Returns:

A list of non-NaN similarity scores for gene pairs that are not directly related.

Return type:

list

get_root_gene_pair_similarities()[source]

Calculates similarity scores between gene pairs in the root node of a hierarchy. Genes in the same community linked to the root node are marked with a similarity of 0, indicating they are directly related, while all other pairs are set to 1, suggesting no direct relation.

Returns:: A DataFrame with genes as both rows and columns, populated with similarity scores.
Return type:: pandas.DataFrame

static get_root_overlapping_pair_similarities(root_pairs, perturbseq_df)[source]

Get similarity scores from perturbseq_df that match genes attached to the root node of the hierarchy

Parameters:

root_pairs (pandas.DataFrame) – A DataFrame representing similarity scores between all genes in the root node, where genes within the same community connected to the root have a score of 0, indicating direct relation, and all other pairs have a score of 1, indicating no direct relation.
perturbseq_df (pandas.DataFrame)

Returns:

A tuple containing: - A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root

and Perturb-seq data.

A DataFrame of root-associated similarity scores, filtered to only include overlapping genes.

Return type:

tuple

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd module

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.get_model_prompt_from_string(o_prompt)[source]

Given argument from –ollama_prompts flag extract model and prompt which can be in following formats:

Where <MODEL> will always just be a string, but <PROMPT> can be a string or a path to a file

Parameters:: o_prompt (str) – argument passed to –ollama_prompts
Returns:: model, prompt
Return type:: tuple

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.get_ollama_geneset_agents(ollama='/usr/local/bin/ollama', ollama_prompts=None, username=None, password=None)[source]

Parses ollama_prompts from argparse and creates geneset agents

Parameters:

ollama (str) – Path to ollama binary or REST service
ollama_prompts (list)

Returns:

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.main(args)[source]

Main entry point for program

Parameters:: args (list) – arguments passed to command line usually sys.argv[1:]()
Returns:: return value of cellmaps_hierarchyeval.runner.CellmapshierarchyevalRunner.run() or 2 if an exception is raised
Return type:: int

cellmaps_hierarchyeval.exceptions module

exception cellmaps_hierarchyeval.exceptions.CellmapshierarchyevalError[source]

Bases: Exception

Base exception for cellmaps_hierarchyeval

cellmaps_hierarchyeval.runner module

class cellmaps_hierarchyeval.runner.BaseNetworkHelper(hierarchy_path)[source]

Bases: object

Base class for network helpers.

Constructor.

Parameters:: hierarchy_path (str) – File system path where the hierarchy network data is stored.

get_hierarchy_input_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:: Prefix path on filesystem where Hierarchy Network resides
Return type:: str

class cellmaps_hierarchyeval.runner.CORUM_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to CORUM.

Constructor. Sets the parameters and initializes the term genes.

Parameters:

terms (NiceCXNetwork or None) – The terms to be processed.
term_name (str or None) – Name of the term.
hierarchy_genes (list or None) – Genes in the hierarchy.
min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.CX2NetworkHelper(hierarchy_path)[source]

Bases: BaseNetworkHelper

Helper class for CX2 network data manipulation that extends the BaseNetworkHelper class with CX2-specific logic.

Constructor.

Parameters:: hierarchy_path (str) – File system path where the CX2 hierarchy network data is stored.

static dump_to_file(hierarchy, hierarchy_out_file)[source]

Save the hierarchy to a CX2 formatted JSON file.

Parameters:

hierarchy (CX2Network) – The hierarchy to save.
hierarchy_out_file (str) – The file path where the hierarchy should be written.

static get_format()[source]

Get string format identifier for CX2 network data.

Returns:: The format identifier for CX2.
Return type:: str

get_hierarchy()[source]

Create and return a CX2 network object from the hierarchy path.

Returns:: An instance of the CX2Network class.
Return type:: CX2Network

static get_hierarchy_real_ids(hierarchy=None, hierarchy_size=None)[source]

Retrieve the real identifiers of nodes within the hierarchy.

Parameters:

hierarchy (CX2Network) – The hierarchy from which to extract node IDs.
hierarchy_size – Not used, but specified for compatibility.

Returns:

A list of node identifiers.

Return type:

list

static get_node_genes(_, node=None)[source]

Extract the gene identifiers from a given node.

Parameters:

_ – Placeholder, not used.
node (dict) – The node from which to extract gene identifiers.

Returns:

A list of gene identifiers.

Return type:

list

static get_nodes(hierarchy)[source]

Retrieve the nodes from the hierarchy.

Parameters:: hierarchy (CX2Network) – The hierarchy from which to retrieve nodes.
Returns:: A dictionary of nodes.
Return type:: dict

static get_suffix()[source]

Get the file suffix associated with CX2 files.

Returns:: The suffix for CX2 file types.
Return type:: str

static write_as_nodelist(hierarchy, dest_path)[source]

Write the nodes of the hierarchy to a specified file path as a tab-delimited list.

Parameters:

hierarchy (CX2Network) – The hierarchy containing the nodes to write.
dest_path (str) – The destination file path for the nodelist.

class cellmaps_hierarchyeval.runner.CellmapshierarchyevalRunner(outdir=None, hierarchy_dir=None, min_comp_size=4, max_fdr=0.05, min_jaccard_index=0.1, corum='633291aa-6e1d-11ef-a7fd-005056ae23aa', go_cc='6722d74d-6e20-11ef-a7fd-005056ae23aa', hpa='68c2f2c0-6e20-11ef-a7fd-005056ae23aa', ndex_server='http://www.ndexbio.org', geneset_agents=None, name=None, organization_name=None, project_name=None, input_data_dict=None, skip_term_enrichment=False, skip_logging=True, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>, geneset_annotator=<cellmaps_hierarchyeval.runner.GeneSetAgentAnnotator object>, provenance=None)[source]

Bases: object

Class to run Hierarchy evaluation

Constructor

Parameters:

outdir (str) – Output directory where results will be written
hierarchy_dir (str) – Directory containing the hierarchy network (output of cellmaps_generate_hierarchy)
min_comp_size (int) – Minimum number of genes required to evaluate a node or term (default: 4)
max_fdr (float) – Maximum adjusted p-value (FDR) to consider an enrichment result significant (default: 0.05)
min_jaccard_index (float) – Minimum Jaccard index required for an enrichment result to be accepted (default: 0.1)
corum (str) – UUID of the CORUM dataset on NDEx for enrichment comparison
go_cc (str) – UUID of the GO Cellular Component dataset on NDEx
hpa (str) – UUID of the Human Protein Atlas dataset on NDEx
ndex_server (str) – NDEx server URL to fetch enrichment datasets from (default: http://www.ndexbio.org)
geneset_agents (list or None) – Optional list of GeneSetAgent instances for gene set annotation
name (str) – Optional name to assign to this evaluation run
organization_name (str) – Optional name of the organization running the tool
project_name (str) – Optional name of the project to associate with this analysis
input_data_dict (dict) – Dictionary of input arguments, used for provenance tracking and command-line logging
skip_term_enrichment (bool) – If True, disables built-in CORUM, GO_CC, and HPA term enrichment
skip_logging (bool) – If True disables logging, otherwise writes logs to output directory
provenance_utils (py:class:cellmaps_utils.provenance.ProvenanceUtil) – ProvenanceUtil object to use for FAIRSCAPE registration
geneset_annotator (GeneSetAgentAnnotator) – Object for applying GeneSetAgent annotations to hierarchy nodes

provenance (dict) –

Optional provenance dictionary if RO-Crate metadata is unavailable Example:

{
    'name': 'Example input dataset',
    'organization-name': 'CM4AI',
    'project-name': 'Example'
}

CORUM = '633291aa-6e1d-11ef-a7fd-005056ae23aa'

GO_CC = '6722d74d-6e20-11ef-a7fd-005056ae23aa'

HPA = '68c2f2c0-6e20-11ef-a7fd-005056ae23aa'

MAX_FDR = 0.05

MIN_COMP_SIZE = 4

MIN_JACCARD_INDEX = 0.1

NDEX_SERVER = 'http://www.ndexbio.org'

generate_readme()[source]

get_annotated_hierarchy_as_nodelist_dest_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:: Prefix path on filesystem to write Hierarchy Network
Return type:: str

get_annotated_hierarchy_dest_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:: Prefix path on filesystem to write Hierarchy Network
Return type:: str

get_hierarchy_parent_network_dest_file()[source]

Creates file path prefix for hierarchy parent network

Example path: /tmp/foo/hierarchy_parent :return:

initialize_hierarchy_helper()[source]

Initializes hierarchy helper which will be used to call custom methods depending on whether the input was in CX or CX2 format.

Returns:

run()[source]

Evaluates CM4AI Hierarchy

Returns:

class cellmaps_hierarchyeval.runner.EnrichmentResult(term=None, pval=None, jaccard_index=None, overlap_genes=None)[source]

Bases: object

Base class for representing the results of enrichment analysis. It generates a hierarchy that is output in the CX format following the CDAPS style.

Constructor

Parameters:

term (str) – The term name.
pval (float) – P-value of the enrichment result.
jaccard_index (float) – Jaccard index of the enrichment result.
overlap_genes (list) – List of overlapping genes.

set_accepted(min_jaccard_index, max_fdr)[source]

Sets the accepted status of the enrichment result based on Jaccard index and FDR criteria.

Parameters:

min_jaccard_index (float) – Minimum required Jaccard index for the result to be accepted.
max_fdr (float) – Maximum allowed adjusted p-value (FDR) for the result to be accepted.

set_adjusted_pval(adjusted_pval)[source]

Sets the adjusted p-value for the enrichment result.

Parameters:: adjusted_pval (float) – Adjusted p-value.

set_description(description)[source]

Sets the description of the enrichment term results.

Parameters:: description (str) – Description for the term results.

class cellmaps_hierarchyeval.runner.EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: object

Base class for implementations that generate term databases for enrichment (i.e., HPA, CORUM, GO)

Constructor

Parameters:

terms (NiceCXNetwork or None) – The terms to be processed.
term_name (str or None) – Name of the term.
hierarchy_genes (list or None) – Genes in the hierarchy.
min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.GO_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to Gene Ontology (GO).

Constructor. Sets the parameters and initializes the term genes and term description.

Parameters:

terms (NiceCXNetwork or None) – The terms to be processed.
term_name (str or None) – Name of the term.
hierarchy_genes (list or None) – Genes in the hierarchy.
min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.GeneSetAgentAnnotator[source]

Bases: object

Annotates hierarchy with results from one or more GeneSetAgent objects

Constructor

annotate_hierarchy(geneset_agent=None, hierarchy=None)[source]: Annotates hierarchy with GeneSetAgent by adding new node attributes :param geneset_agent: :param hierarchy: :return:

set_hierarchy_helper(hierarchy_helper)[source]

Sets HierarchyHelper

Parameters:: hierarchy_helper
Returns:

set_minimum_comparison_size(val)[source]: Only examine genesets of size val or larger :param val: :type val: int

class cellmaps_hierarchyeval.runner.HPA_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to the Human Protein Atlas (HPA).

Constructor

Parameters:

terms (NiceCXNetwork or None) – The terms to be processed.
term_name (str or None) – Name of the term.
hierarchy_genes (list or None) – Genes in the hierarchy.
min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.HiDeF_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to HiDeF output.

Constructor. Sets the parameters and initializes the term genes.

Parameters:

terms (NiceCXNetwork or None) – The terms to be processed.
term_name (str or None) – Name of the term.
hierarchy_genes (list or None) – Genes in the hierarchy.
min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.NiceCXNetworkHelper(hierarchy_path)[source]

Bases: BaseNetworkHelper

Helper class for NiceCX network data manipulation that extends the BaseNetworkHelper class with CX-specific logic.

Constructor.

Parameters:: hierarchy_path (str) – File system path where the NiceCX hierarchy network data is stored.

static dump_to_file(hierarchy, hierarchy_out_file)[source]

Save the hierarchy to a CX formatted JSON file.

Parameters:

hierarchy (ndex2.nice_cx_network.NiceCXNetwork) – The hierarchy to save.
hierarchy_out_file (str) – The file path where the hierarchy should be written.

static get_format()[source]

Get the string format identifier for CX data.

Returns:: The format identifier for NiceCX.
Return type:: str

get_hierarchy()[source]

Create and return a NiceCXNetwork object from the hierarchy path.

Returns:: An instance of the NiceCX network class.
Return type:: ndex2.nice_cx_network.NiceCXNetwork

static get_hierarchy_real_ids(hierarchy=None, hierarchy_size=None)[source]

Generate a list of real IDs for a given hierarchy size.

Parameters:

hierarchy – Not used, provided for compatibility.
hierarchy_size (int) – The size of the hierarchy to generate IDs for.

Returns:

A list of sequential integers representing node IDs.

Return type:

list

static get_node_genes(hierarchy=None, node=None)[source]

Extract the set of gene identifiers from a given node in the hierarchy.

Parameters:

hierarchy (ndex2.nice_cx_network.NiceCXNetwork) – The hierarchy containing the node.
node (int) – The node from which to extract gene identifiers.

Returns:

A set of gene identifiers.

Return type:

set

static get_nodes(hierarchy)[source]

Retrieve the nodes from the hierarchy.

Parameters:: hierarchy (ndex2.nice_cx_network.NiceCXNetwork) – The hierarchy from which to retrieve nodes.
Returns:: A dictionary of nodes.
Return type:: dict

static get_suffix()[source]

Get the file suffix associated with CX files.

Returns:: The suffix for NiceCX file types.
Return type:: str

static write_as_nodelist(hierarchy, dest_path)[source]

Write the nodes of the hierarchy to a specified file path as a tab-delimited list.

Parameters:

hierarchy (ndex2.nice_cx_network.NiceCXNetwork) – The hierarchy containing the nodes to write.
dest_path (str) – The destination file path for the nodelist.

Module contents

Top-level package for cellmaps_hierarchyeval.