Metascape Gene List Analysis Report

metascape.org¹

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.

Metascape only visualizes the top 20 clusters. Up to 100 enriched clusters can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2020-03-19). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.

Name	Total	Unique
Input ID	522	305

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2020-03-19) (Table 2).

Table 2. Gene annotations extracted

Name	Type	Description
Gene Symbol	Description	Primary HUGO gene symbol.
Description	Description	Short description.
Biological Process (GO)	Function/Location	Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Cellular Component (GO)	Function/Location	Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Molecular Function (GO)	Function/Location	Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
GWAS (NHGRI-EBI)	Genotype/Phenotype/Disease	Genome-wide association study (NHGRI)
Developmental Disorders (DDG2P)	Genotype/Phenotype/Disease	Developmental Disorders (DDG2P)
Disease and gene associations (DisGeNET)	Genotype/Phenotype/Disease	Large collections of genes and variants associated to human disease
Canonical Pathways	Ontology	Canonical Pathways
KEGG Pathway	Ontology	KEGG Pathway
Hallmark Gene Sets	Ontology	Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: GO Biological Processes, GO Cellular Components, GO Molecular Functions and DisGeNET. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the accumulative hypergeometric distribution², and q-values are calculated using the Banjamini-Hochberg procedure to account for multiple testings³. Kappa scores⁴ are used as the similarity metric when performing hierachical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.

GO	Category	Description	Count	%	Log10(P)	Log10(q)
GO:0007507	GO Biological Processes	heart development	58	27.23	-43.31	-38.76
GO:0003007	GO Biological Processes	heart morphogenesis	35	16.43	-30.36	-26.42
GO:0060485	GO Biological Processes	mesenchyme development	31	14.55	-24.38	-20.94
GO:0001501	GO Biological Processes	skeletal system development	36	16.90	-20.92	-17.80
GO:0001568	GO Biological Processes	blood vessel development	31	29.52	-20.90	-17.80
GO:0031012	GO Cellular Components	extracellular matrix	34	15.96	-18.82	-15.78
GO:0008015	GO Biological Processes	blood circulation	34	15.96	-18.59	-15.58
GO:0070848	GO Biological Processes	response to growth factor	38	17.84	-17.97	-15.05
GO:0045596	GO Biological Processes	negative regulation of cell differentiation	37	17.37	-16.67	-13.86
GO:0001503	GO Biological Processes	ossification	21	20.00	-16.20	-13.43
GO:0003197	GO Biological Processes	endocardial cushion development	13	6.10	-15.84	-13.13
GO:0055123	GO Biological Processes	digestive system development	18	8.45	-15.21	-12.57
GO:2000027	GO Biological Processes	regulation of animal organ morphogenesis	16	15.24	-13.70	-11.21
GO:0043408	GO Biological Processes	regulation of MAPK cascade	23	21.90	-12.88	-10.47
GO:0051890	GO Biological Processes	regulation of cardioblast differentiation	6	5.71	-12.79	-10.39
GO:0003207	GO Biological Processes	cardiac chamber formation	7	3.29	-11.54	-9.26
GO:2000826	GO Biological Processes	regulation of heart morphogenesis	10	4.69	-11.47	-9.19
GO:0048608	GO Biological Processes	reproductive structure development	17	16.19	-11.14	-8.89
GO:0061383	GO Biological Processes	trabecula morphogenesis	10	4.69	-10.85	-8.63
GO:0030029	GO Biological Processes	actin filament-based process	30	14.08	-10.83	-8.61

To further capture the relationships between the terms, a subset of enriched terms have been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape⁵, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: BioGrid⁶, InWeb_IM⁷, OmniPath⁸. The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm⁹ has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.

GO	Description	Log10(P)
GO:0007507	heart development	-27.4
GO:0003205	cardiac chamber development	-24.3
GO:0003206	cardiac chamber morphogenesis	-24.0

MCODE	GO	Description	Log10(P)
MCODE_1	GO:0007200	phospholipase C-activating G protein-coupled receptor signaling pathway	-11.0
MCODE_1	GO:0051482	positive regulation of cytosolic calcium ion concentration involved in phospholipase C-activating G	-10.1
MCODE_1	GO:0051480	regulation of cytosolic calcium ion concentration	-8.3
MCODE_2	GO:0005104	fibroblast growth factor receptor binding	-8.0
MCODE_2	GO:0051897	positive regulation of protein kinase B signaling	-7.9
MCODE_2	GO:0007169	transmembrane receptor protein tyrosine kinase signaling pathway	-7.6
MCODE_3	GO:0003208	cardiac ventricle morphogenesis	-12.6
MCODE_3	GO:0003215	cardiac right ventricle morphogenesis	-11.7
MCODE_3	GO:0003231	cardiac ventricle development	-11.4
MCODE_5	GO:0062009	secondary palate development	-9.1
MCODE_5	GO:0060412	ventricular septum morphogenesis	-8.2
MCODE_5	GO:0003281	ventricular septum development	-7.5

Gene Prioritization by Evidence Counting (GPEC) (beta feature)

GPEC is an effective way to identify a subset of genes that are more likely to be of higher quality hits. As we are still working on a GPEC publication, please be advised of the risk when using GPEC analysis results. A gene receives an evidence token whenever it is identified as a hit within an input gene list, falls into at least one enriched pathway (pathway size is no more than 100) derived from an input gene list, or is part of the protein-protein interaction network formed by a given input gene list. The evidence count of a gene is the total number of evidence token it receives. Given n input gene lists, the maximum possible evidence count is 3n. Our research suggests that the likelihood of a gene being a true biological hit increases as its total evidence count increases. Therefore, all input genes can be ranked in descending order by their evidence counts, and then various evidence count cutoffs can be applied to generate new gene lists of varying quality. Shorter gene lists containing the most of the top-ranked genes are of higher quality (i.e., higher precision and lower false discovery rate), and longer gene lists obtained under relaxed cutoffs are more comphrehesive (i.e., higher recall and sensitivity). Term enrichment analysis is carried out for different cutoffs, and the best p-value per term is retained. For protein network analysis, an evidence cutoff that yields a _FINAL network of approximately 250 protein nodes is selected, so that the network remains rich yet visually interpretable. Conceptually, the processes and network components identified by the GPEC algorithm is more robust, as it is based on the subset of genes that have higher evidence counts, compared to relying on the list in which all gene lists are merged into one. Additional information regarding GPEC can be found on the menu page.

Reference

Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.