cellmaps_hierarchyeval package

Submodules

cellmaps_hierarchyeval.analysis module

class cellmaps_hierarchyeval.analysis.Assembly(node_id=None, gene_names=None)[source]

Bases: object

Represents assembly in a hierarchy

Constructor

Parameters:
  • node_id (int) – Id of hierarchy node

  • gene_names (list) – list of gene names

get_assembly_name()[source]

Gets name of assembly :return:

get_gene_names()[source]

Gets gene names

Returns:

get_node_id()[source]

Gets node id

Returns:

set_assembly_name()[source]

Sets assembly name :return:

class cellmaps_hierarchyeval.analysis.FakeGeneSetAgent(random_seed=None, attribute_name_prefix=None)[source]

Bases: GenesetAgent

Fake geneset agent that generates random numbers for values

Constructor :param random_seed:

annotate_gene_set(gene_names=None)[source]
Parameters:

gene_names

Returns:

class cellmaps_hierarchyeval.analysis.GenesetAgent(attribute_name_prefix=None)[source]

Bases: object

Represents a Gene set analysis agent whose job is to consume a list of gene names and return a term name, confidence score, and analysis

Constructor

GENE_SET_TOKEN = 'GENE_SET'
annotate_gene_set(gene_names=None)[source]

Should be implemented by subclasses

Parameters:

gene_names – gene symbols

Returns:

Return type:

tuple

get_attribute_name_prefix()[source]

Gets suggested attribute name prefix

Returns:

Return type:

str

class cellmaps_hierarchyeval.analysis.Hierarchy(hierarchy=None, interactome=None, ndex_username=None, ndex_password=None)[source]

Bases: object

Represents an assembly of proteins in a Hierarchy

Constructor :param hierarchy: Hierarchy :type hierarchy: CX2Network :param interactome: Parent interactome :type interactome: CX2Network :param ndex_username: NDEx username to use when connecting

to NDEx to obtain interactomes from hierarchy

Parameters:

ndex_password (str) – NDEx password to use when connecting to NDEx to obtain interactomes from hierarchy

get_next_assembly()[source]

Generator that gets next assembly in hierarchy

Returns:

Return type:

Assembly

class cellmaps_hierarchyeval.analysis.OllamaCommandLineGeneSetAgent(prompt=None, model='llama2:latest', ollama_binary='/usr/local/bin/ollama', attribute_name_prefix=None)[source]

Bases: GenesetAgent

Runs

Constructor

Parameters:

prompt (str) – Prompt to pass to LLM put @@GENE_SET@@ into prompt to denote where gene set should be inserted. If None default internal prompt is used

DEFAULT_PROMPT_FILE = 'default_prompt.txt'
annotate_gene_set(gene_names=None)[source]

Using prompt passed in via constructor, this call invokes the LLM specified by model set in constructor

Parameters:

gene_names (list) – Genes to analyze

Raises:

CellmapshierarchyevalError – If LLM failed to run

Returns:

(‘process name (score)’, full output from LLM)

Return type:

tuple

static get_default_prompt()[source]

Gets default prompt stored with this package

Returns:

Return type:

str

get_prompt()[source]

Gets prompt used by this agent :return:

class cellmaps_hierarchyeval.analysis.OllamaRestServiceGenesetAgent(prompt=None, model='llama2:latest', username=None, password=None, rest_url=None, temperature=0, max_tokens=1000, seed=42, attribute_name_prefix=None, max_retries=5, timeout=120, retry_wait=10)[source]

Bases: GenesetAgent

Calls LLM via REST service. Derived from ServerModel_LLM in https://github.com/idekerlab/agent_evaluation llm.py

Constructor

Parameters:
  • prompt (str) – Prompt to send to LLM

  • model (str) – Name of model

  • username (str) – Username to send via Basic Auth to service

  • password (str) – Password to send via Basic Auth to service

  • rest_url (str) – URL for service, should end with api/generate

  • temperature

  • max_tokens

  • seed

  • attribute_name_prefix

  • max_retries (int) – Number of times to retry failed query

  • timeout (int or float) – Time in seconds to wait for response from service

  • retry_wait (int or float) – Time in seconds to wait between retries for failed query

annotate_gene_set(gene_names=None)[source]

Using prompt passed in via constructor, this call invokes the LLM specified by model set in constructor

Parameters:

gene_names (list) – Genes to analyze

Raises:

CellmapshierarchyevalError – If LLM failed to run

Returns:

(‘process name (score)’, full output from LLM)

Return type:

tuple

get_prompt()[source]

Gets prompt used by this agent :return:

cellmaps_hierarchyeval.perturb module

class cellmaps_hierarchyeval.perturb.PerturbSeqAnalysis(hierarchy, hierarchy_parent=None)[source]

Bases: object

Contains utilities to compare Perturbation data against hierarchy passed in via constructor

Constructor

Parameters:
static compare_cluster_root_similarities(cluster_functional_data_similarity, root_functional_data_similarity)[source]

Performs a rank-sum test to compare the distribution of functional data similarity scores between a specific cluster and gene pairs in root. This test helps determine if the similarity scores in the cluster are statistically significantly greater than those in the root.

Parameters:
  • cluster_functional_data_similarity (numpy.array) – An array of similarity scores within a specific cluster.

  • root_functional_data_similarity (list) – A list of non-NaN similarity scores for gene pairs not directly related in the root.

Returns:

A tuple containing the test statistic and the p-value of the rank-sum test.

Return type:

(float, float)

get_cluster_similarity(functional_data_similarity, hier_system_node_id)[source]

Retrieves the upper triangle similarity scores for genes within a specific cluster of a hierarchy. The scores are extracted from a DataFrame that contains scaled cosine similarity scores for genes that overlap between communities direct to root and Perturb-seq data.

Parameters:
  • functional_data_similarity (pandas.DataFrame) – A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root and Perturb-seq data.

  • hier_system_node_id (int) – The identifier for a specific node within a hierarchy.

Returns:

An array of similarity scores from the upper triangle portion of the matrix for the specified cluster.

Return type:

numpy.array()

get_heatmap_for_given_hierarchy_system(hier_system_node_id, perturbseq_df, num_perturb_seq=25)[source]

Given an id for a system in hierarchy hier_system_node_id and Perturb-seq data perturbseq_df create a heatmap of X most variable Perturb-seq proteins.

This is done by filtering perturbseq_df for rows that match genes in given system and then keeping num_perturb_seq most variable columns

Parameters:
  • hier_system_node_id (int) – node id system to analyze

  • perturbseq_df (pandas.DataFrame)

  • num_perturb_seq (int)

Returns:

heat map table

Return type:

pandas.DataFrame

static get_root_functional_data_similarity(functional_data_similarity, overlap_root_pairs)[source]
Extracts and returns a list of functional similarity scores for gene pairs that are not in the same community,

based on a filtered upper triangle extraction of the similarity matrix (ensures that only unique, non-redundant gene pair comparisons are considered).

Parameters:
  • functional_data_similarity (pandas.DataFrame) – A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root and Perturb-seq data.

  • overlap_root_pairs (pandas.DataFrame) – A DataFrame of root-associated similarity scores, filtered to only include overlapping genes. A score of 0 indicates a direct relation (same community) and scores greater than 0 indicate no direct relation

Returns:

A list of non-NaN similarity scores for gene pairs that are not directly related.

Return type:

list

get_root_gene_pair_similarities()[source]

Calculates similarity scores between gene pairs in the root node of a hierarchy. Genes in the same community linked to the root node are marked with a similarity of 0, indicating they are directly related, while all other pairs are set to 1, suggesting no direct relation.

Returns:

A DataFrame with genes as both rows and columns, populated with similarity scores.

Return type:

pandas.DataFrame

static get_root_overlapping_pair_similarities(root_pairs, perturbseq_df)[source]

Get similarity scores from perturbseq_df that match genes attached to the root node of the hierarchy

Parameters:
  • root_pairs (pandas.DataFrame) – A DataFrame representing similarity scores between all genes in the root node, where genes within the same community connected to the root have a score of 0, indicating direct relation, and all other pairs have a score of 1, indicating no direct relation.

  • perturbseq_df (pandas.DataFrame)

Returns:

A tuple containing: - A DataFrame of scaled cosine similarity scores for overlapping genes in communities direct to root

and Perturb-seq data.

  • A DataFrame of root-associated similarity scores, filtered to only include overlapping genes.

Return type:

tuple

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd module

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.get_model_prompt_from_string(o_prompt)[source]

Given argument from –ollama_prompts flag extract model and prompt which can be in following formats:

<MODEL> or <MODEL>,<PROMPT>

Where <MODEL> will always just be a string, but <PROMPT> can be a string or a path to a file

Parameters:

o_prompt (str) – argument passed to –ollama_prompts

Returns:

model, prompt

Return type:

tuple

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.get_ollama_geneset_agents(ollama='/usr/local/bin/ollama', ollama_prompts=None, username=None, password=None)[source]

Parses ollama_prompts from argparse and creates geneset agents

Parameters:
  • ollama (str) – Path to ollama binary or REST service

  • ollama_prompts (list)

Returns:

cellmaps_hierarchyeval.cellmaps_hierarchyevalcmd.main(args)[source]

Main entry point for program

Parameters:

args (list) – arguments passed to command line usually sys.argv[1:]()

Returns:

return value of cellmaps_hierarchyeval.runner.CellmapshierarchyevalRunner.run() or 2 if an exception is raised

Return type:

int

cellmaps_hierarchyeval.exceptions module

exception cellmaps_hierarchyeval.exceptions.CellmapshierarchyevalError[source]

Bases: Exception

Base exception for cellmaps_hierarchyeval

cellmaps_hierarchyeval.runner module

class cellmaps_hierarchyeval.runner.BaseNetworkHelper(hierarchy_path)[source]

Bases: object

Base class for network helpers.

Constructor.

Parameters:

hierarchy_path (str) – File system path where the hierarchy network data is stored.

get_hierarchy_input_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:

Prefix path on filesystem where Hierarchy Network resides

Return type:

str

class cellmaps_hierarchyeval.runner.CORUM_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to CORUM.

Constructor. Sets the parameters and initializes the term genes.

Parameters:
  • terms (NiceCXNetwork or None) – The terms to be processed.

  • term_name (str or None) – Name of the term.

  • hierarchy_genes (list or None) – Genes in the hierarchy.

  • min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.CX2NetworkHelper(hierarchy_path)[source]

Bases: BaseNetworkHelper

Helper class for CX2 network data manipulation that extends the BaseNetworkHelper class with CX2-specific logic.

Constructor.

Parameters:

hierarchy_path (str) – File system path where the CX2 hierarchy network data is stored.

static dump_to_file(hierarchy, hierarchy_out_file)[source]

Save the hierarchy to a CX2 formatted JSON file.

Parameters:
  • hierarchy (CX2Network) – The hierarchy to save.

  • hierarchy_out_file (str) – The file path where the hierarchy should be written.

static get_format()[source]

Get string format identifier for CX2 network data.

Returns:

The format identifier for CX2.

Return type:

str

get_hierarchy()[source]

Create and return a CX2 network object from the hierarchy path.

Returns:

An instance of the CX2Network class.

Return type:

CX2Network

static get_hierarchy_real_ids(hierarchy=None, hierarchy_size=None)[source]

Retrieve the real identifiers of nodes within the hierarchy.

Parameters:
  • hierarchy (CX2Network) – The hierarchy from which to extract node IDs.

  • hierarchy_size – Not used, but specified for compatibility.

Returns:

A list of node identifiers.

Return type:

list

static get_node_genes(_, node=None)[source]

Extract the gene identifiers from a given node.

Parameters:
  • _ – Placeholder, not used.

  • node (dict) – The node from which to extract gene identifiers.

Returns:

A list of gene identifiers.

Return type:

list

static get_nodes(hierarchy)[source]

Retrieve the nodes from the hierarchy.

Parameters:

hierarchy (CX2Network) – The hierarchy from which to retrieve nodes.

Returns:

A dictionary of nodes.

Return type:

dict

static get_suffix()[source]

Get the file suffix associated with CX2 files.

Returns:

The suffix for CX2 file types.

Return type:

str

static write_as_nodelist(hierarchy, dest_path)[source]

Write the nodes of the hierarchy to a specified file path as a tab-delimited list.

Parameters:
  • hierarchy (CX2Network) – The hierarchy containing the nodes to write.

  • dest_path (str) – The destination file path for the nodelist.

class cellmaps_hierarchyeval.runner.CellmapshierarchyevalRunner(outdir=None, hierarchy_dir=None, min_comp_size=4, max_fdr=0.05, min_jaccard_index=0.1, corum='633291aa-6e1d-11ef-a7fd-005056ae23aa', go_cc='6722d74d-6e20-11ef-a7fd-005056ae23aa', hpa='68c2f2c0-6e20-11ef-a7fd-005056ae23aa', ndex_server='http://www.ndexbio.org', geneset_agents=None, name=None, organization_name=None, project_name=None, input_data_dict=None, skip_term_enrichment=False, skip_logging=True, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>, geneset_annotator=<cellmaps_hierarchyeval.runner.GeneSetAgentAnnotator object>, provenance=None)[source]

Bases: object

Class to run Hierarchy evaluation

Constructor

Parameters:
  • outdir (str) – Output directory where results will be written

  • hierarchy_dir (str) – Directory containing the hierarchy network (output of cellmaps_generate_hierarchy)

  • min_comp_size (int) – Minimum number of genes required to evaluate a node or term (default: 4)

  • max_fdr (float) – Maximum adjusted p-value (FDR) to consider an enrichment result significant (default: 0.05)

  • min_jaccard_index (float) – Minimum Jaccard index required for an enrichment result to be accepted (default: 0.1)

  • corum (str) – UUID of the CORUM dataset on NDEx for enrichment comparison

  • go_cc (str) – UUID of the GO Cellular Component dataset on NDEx

  • hpa (str) – UUID of the Human Protein Atlas dataset on NDEx

  • ndex_server (str) – NDEx server URL to fetch enrichment datasets from (default: http://www.ndexbio.org)

  • geneset_agents (list or None) – Optional list of GeneSetAgent instances for gene set annotation

  • name (str) – Optional name to assign to this evaluation run

  • organization_name (str) – Optional name of the organization running the tool

  • project_name (str) – Optional name of the project to associate with this analysis

  • input_data_dict (dict) – Dictionary of input arguments, used for provenance tracking and command-line logging

  • skip_term_enrichment (bool) – If True, disables built-in CORUM, GO_CC, and HPA term enrichment

  • skip_logging (bool) – If True disables logging, otherwise writes logs to output directory

  • provenance_utils (py:class:cellmaps_utils.provenance.ProvenanceUtil) – ProvenanceUtil object to use for FAIRSCAPE registration

  • geneset_annotator (GeneSetAgentAnnotator) – Object for applying GeneSetAgent annotations to hierarchy nodes

  • provenance (dict) –

    Optional provenance dictionary if RO-Crate metadata is unavailable Example:

    {
        'name': 'Example input dataset',
        'organization-name': 'CM4AI',
        'project-name': 'Example'
    }
    

CORUM = '633291aa-6e1d-11ef-a7fd-005056ae23aa'
GO_CC = '6722d74d-6e20-11ef-a7fd-005056ae23aa'
HPA = '68c2f2c0-6e20-11ef-a7fd-005056ae23aa'
MAX_FDR = 0.05
MIN_COMP_SIZE = 4
MIN_JACCARD_INDEX = 0.1
NDEX_SERVER = 'http://www.ndexbio.org'
generate_readme()[source]
get_annotated_hierarchy_as_nodelist_dest_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:

Prefix path on filesystem to write Hierarchy Network

Return type:

str

get_annotated_hierarchy_dest_file()[source]

Creates file path prefix for hierarchy

Example path: /tmp/foo/hierarchy

Returns:

Prefix path on filesystem to write Hierarchy Network

Return type:

str

get_hierarchy_parent_network_dest_file()[source]

Creates file path prefix for hierarchy parent network

Example path: /tmp/foo/hierarchy_parent :return:

initialize_hierarchy_helper()[source]

Initializes hierarchy helper which will be used to call custom methods depending on whether the input was in CX or CX2 format.

Returns:

run()[source]

Evaluates CM4AI Hierarchy

Returns:

class cellmaps_hierarchyeval.runner.EnrichmentResult(term=None, pval=None, jaccard_index=None, overlap_genes=None)[source]

Bases: object

Base class for representing the results of enrichment analysis. It generates a hierarchy that is output in the CX format following the CDAPS style.

Constructor

Parameters:
  • term (str) – The term name.

  • pval (float) – P-value of the enrichment result.

  • jaccard_index (float) – Jaccard index of the enrichment result.

  • overlap_genes (list) – List of overlapping genes.

set_accepted(min_jaccard_index, max_fdr)[source]

Sets the accepted status of the enrichment result based on Jaccard index and FDR criteria.

Parameters:
  • min_jaccard_index (float) – Minimum required Jaccard index for the result to be accepted.

  • max_fdr (float) – Maximum allowed adjusted p-value (FDR) for the result to be accepted.

set_adjusted_pval(adjusted_pval)[source]

Sets the adjusted p-value for the enrichment result.

Parameters:

adjusted_pval (float) – Adjusted p-value.

set_description(description)[source]

Sets the description of the enrichment term results.

Parameters:

description (str) – Description for the term results.

class cellmaps_hierarchyeval.runner.EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: object

Base class for implementations that generate term databases for enrichment (i.e., HPA, CORUM, GO)

Constructor

Parameters:
  • terms (NiceCXNetwork or None) – The terms to be processed.

  • term_name (str or None) – Name of the term.

  • hierarchy_genes (list or None) – Genes in the hierarchy.

  • min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.GO_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to Gene Ontology (GO).

Constructor. Sets the parameters and initializes the term genes and term description.

Parameters:
  • terms (NiceCXNetwork or None) – The terms to be processed.

  • term_name (str or None) – Name of the term.

  • hierarchy_genes (list or None) – Genes in the hierarchy.

  • min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.GeneSetAgentAnnotator[source]

Bases: object

Annotates hierarchy with results from one or more GeneSetAgent objects

Constructor

annotate_hierarchy(geneset_agent=None, hierarchy=None)[source]

Annotates hierarchy with GeneSetAgent by adding new node attributes :param geneset_agent: :param hierarchy: :return:

set_hierarchy_helper(hierarchy_helper)[source]

Sets HierarchyHelper

Parameters:

hierarchy_helper

Returns:

set_minimum_comparison_size(val)[source]

Only examine genesets of size val or larger :param val: :type val: int

class cellmaps_hierarchyeval.runner.HPA_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to the Human Protein Atlas (HPA).

Constructor

Parameters:
  • terms (NiceCXNetwork or None) – The terms to be processed.

  • term_name (str or None) – Name of the term.

  • hierarchy_genes (list or None) – Genes in the hierarchy.

  • min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.HiDeF_EnrichmentTerms(terms=None, term_name=None, hierarchy_genes=None, min_comp_size=4)[source]

Bases: EnrichmentTerms

This class extends the EnrichmentTerms class to handle terms specific to HiDeF output.

Constructor. Sets the parameters and initializes the term genes.

Parameters:
  • terms (NiceCXNetwork or None) – The terms to be processed.

  • term_name (str or None) – Name of the term.

  • hierarchy_genes (list or None) – Genes in the hierarchy.

  • min_comp_size (int) – Minimum number of genes in a term for it to be considered.

class cellmaps_hierarchyeval.runner.NiceCXNetworkHelper(hierarchy_path)[source]

Bases: BaseNetworkHelper

Helper class for NiceCX network data manipulation that extends the BaseNetworkHelper class with CX-specific logic.

Constructor.

Parameters:

hierarchy_path (str) – File system path where the NiceCX hierarchy network data is stored.

static dump_to_file(hierarchy, hierarchy_out_file)[source]

Save the hierarchy to a CX formatted JSON file.

Parameters:
static get_format()[source]

Get the string format identifier for CX data.

Returns:

The format identifier for NiceCX.

Return type:

str

get_hierarchy()[source]

Create and return a NiceCXNetwork object from the hierarchy path.

Returns:

An instance of the NiceCX network class.

Return type:

ndex2.nice_cx_network.NiceCXNetwork

static get_hierarchy_real_ids(hierarchy=None, hierarchy_size=None)[source]

Generate a list of real IDs for a given hierarchy size.

Parameters:
  • hierarchy – Not used, provided for compatibility.

  • hierarchy_size (int) – The size of the hierarchy to generate IDs for.

Returns:

A list of sequential integers representing node IDs.

Return type:

list

static get_node_genes(hierarchy=None, node=None)[source]

Extract the set of gene identifiers from a given node in the hierarchy.

Parameters:
Returns:

A set of gene identifiers.

Return type:

set

static get_nodes(hierarchy)[source]

Retrieve the nodes from the hierarchy.

Parameters:

hierarchy (ndex2.nice_cx_network.NiceCXNetwork) – The hierarchy from which to retrieve nodes.

Returns:

A dictionary of nodes.

Return type:

dict

static get_suffix()[source]

Get the file suffix associated with CX files.

Returns:

The suffix for NiceCX file types.

Return type:

str

static write_as_nodelist(hierarchy, dest_path)[source]

Write the nodes of the hierarchy to a specified file path as a tab-delimited list.

Parameters:

Module contents

Top-level package for cellmaps_hierarchyeval.