Metascape Gene List Analysis Report

metascape.org¹

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.

Metascape only visualizes the top 20 clusters. Up to 100 enriched clusters can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2021-05-01). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.

Name	Total	Unique
MyList	238	187

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2021-05-01) (Table 2).

Table 2. Gene annotations extracted

Name	Type	Description
Gene Symbol	Description	Primary HUGO gene symbol.
Description	Description	Short description.
Biological Process (GO)	Function/Location	Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Kinase Class (UniProt)	Function/Location	Detailed kinase classes.
Protein Function (Protein Atlas)	Function/Location	Protein Function (Protein Atlas)
Subcellular Location (Protein Atlas)	Function/Location	Sucellular Location (Protein Atlas)
Drug (DrugBank)	Genotype/Phenotype/Disease	Drug information for the given gene as target.
Canonical Pathways	Ontology	Canonical Pathways
Hallmark Gene Sets	Ontology	Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, CORUM, TRRUST, DisGeNET, PaGenBase, Transcription Factor Targets, WikiPathways, PANTHER Pathway and COVID. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the accumulative hypergeometric distribution², and q-values are calculated using the Banjamini-Hochberg procedure to account for multiple testings³. Kappa scores⁴ are used as the similarity metric when performing hierachical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.

GO	Category	Description	Count	%	Log10(P)	Log10(q)
GO:0048598	GO Biological Processes	embryonic morphogenesis	40	21.86	-28.95	-24.59
GO:0003002	GO Biological Processes	regionalization	31	16.94	-26.29	-22.23
WP2064	WikiPathways	Neural Crest Differentiation	20	10.93	-23.49	-19.60
GO:0007423	GO Biological Processes	sensory organ development	33	18.03	-22.01	-18.35
GO:0021953	GO Biological Processes	central nervous system neuron differentiation	20	10.93	-18.35	-14.77
GO:0045165	GO Biological Processes	cell fate commitment	21	11.48	-16.69	-13.37
GO:0045665	GO Biological Processes	negative regulation of neuron differentiation	10	5.46	-11.00	-8.11
GO:0048729	GO Biological Processes	tissue morphogenesis	23	12.57	-10.47	-7.62
GO:0048736	GO Biological Processes	appendage development	13	7.10	-10.11	-7.28
GO:0007517	GO Biological Processes	muscle organ development	16	8.74	-9.49	-6.72
GO:0051301	GO Biological Processes	cell division	21	11.48	-9.36	-6.60
GO:0009953	GO Biological Processes	dorsal/ventral pattern formation	9	4.92	-8.73	-6.01
GO:0019827	GO Biological Processes	stem cell population maintenance	11	6.01	-8.48	-5.80
GO:0051216	GO Biological Processes	cartilage development	12	6.56	-8.46	-5.79
GO:0035270	GO Biological Processes	endocrine system development	10	5.46	-8.11	-5.47
GO:1904888	GO Biological Processes	cranial skeletal system development	8	4.37	-8.00	-5.38
GO:0048483	GO Biological Processes	autonomic nervous system development	7	3.83	-7.72	-5.14
GO:0030902	GO Biological Processes	hindbrain development	10	5.46	-7.57	-5.01
ko04550	KEGG Pathway	Signaling pathways regulating pluripotency of stem cells	10	5.46	-7.57	-5.01
WP2855	WikiPathways	Dopaminergic Neurogenesis	6	3.28	-7.44	-4.91

To further capture the relationships between the terms, a subset of enriched terms have been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape⁵, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: STRING⁶, BioGrid⁷, OmniPath⁸, InWeb_IM⁹.Only physical interactions in STRING (physical score > 0.132) and BioGrid are used (details). The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm¹⁰ has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.

GO	Description	Log10(P)
GO:0048598	embryonic morphogenesis	-26.4
WP2064	Neural Crest Differentiation	-22.1
GO:0007423	sensory organ development	-19.8

MCODE	GO	Description	Log10(P)
MCODE_1	GO:0007423	sensory organ development	-7.4
MCODE_1	GO:0001708	cell fate specification	-7.4
MCODE_1	GO:0048598	embryonic morphogenesis	-7.3
MCODE_2	GO:0051301	cell division	-14.1
MCODE_2	GO:0140014	mitotic nuclear division	-11.8
MCODE_2	R-HSA-983189	Kinesins	-11.1
MCODE_3	R-HSA-2559584	Formation of Senescence-Associated Heterochromatin Foci (SAHF)	-8.8
MCODE_3	R-HSA-2559583	Cellular Senescence	-7.9
MCODE_3	R-HSA-2559586	DNA Damage/Telomere Stress Induced Senescence	-6.7
MCODE_4	GO:0048598	embryonic morphogenesis	-8.5
MCODE_4	GO:0009954	proximal/distal pattern formation	-7.9
MCODE_4	GO:0048562	embryonic organ morphogenesis	-5.0
MCODE_5	R-HSA-8948216	Collagen chain trimerization	-11.3
MCODE_5	M3005	NABA COLLAGENS	-11.3
MCODE_5	R-HSA-1442490	Collagen degradation	-10.6
MCODE_6	R-HSA-452723	Transcriptional regulation of pluripotent stem cells	-9.3
MCODE_6	GO:0035019	somatic stem cell population maintenance	-7.8
MCODE_6	GO:0019827	stem cell population maintenance	-6.9
MCODE_7	GO:0034653	retinoic acid catabolic process	-11.6
MCODE_7	GO:0016103	diterpenoid catabolic process	-11.6
MCODE_7	GO:0016115	terpenoid catabolic process	-11.0

Quality Control and Association Analysis

Gene list enrichments are identified in the following ontology categories: Transcription_Factor_Targets, TRRUST, DisGeNET, PaGenBase, COVID. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The top few enriched clusters (one term per cluster) are shown in the Figure 4-8. The algorithm used here is the same as that is used for pathway and process enrichment analysis.

Figure 4. Summary of enrichment analysis in Transcription_Factor_Targets¹¹.

GO	Description	Count	%	Log10(P)	Log10(q)
M30019	HSD17B8 TARGET GENES	40	22.00	-27.00	-22.00
M8172	HNF6 Q6	17	9.30	-12.00	-8.50
M4815	WTTGKCTG UNKNOWN	19	10.00	-8.80	-5.40
M18169	YATGNWAAT OCT C	16	8.70	-8.60	-5.20
M6325	SMAD3 Q6	13	7.10	-8.10	-4.80
M946	GGATTA PITX2 Q2	19	10.00	-7.90	-4.70
M6490	IPF1 Q4	13	7.10	-7.80	-4.60
M11587	CHX10 01	12	6.60	-7.40	-4.30
M3902	POU1F1 Q6	12	6.60	-7.20	-4.10
M5708	OCT1 05	12	6.60	-6.90	-3.90
M1328	WTGAAAT UNKNOWN	18	9.80	-6.80	-3.90
M4225	NFY 01	12	6.60	-6.80	-3.90
M9868	NFY Q6	12	6.60	-6.80	-3.80
M14960	POU3F2 02	12	6.60	-6.70	-3.80
M19088	TST1 01	12	6.60	-6.70	-3.80
M15623	SOX5 01	12	6.60	-6.70	-3.80
M4764	TGTTTGY HNF3 Q6	19	10.00	-6.30	-3.50
M16345	NKX3A 01	11	6.00	-6.30	-3.50
M6135	POU6F1 01	11	6.00	-6.30	-3.50
M1224	TCF4 Q5	11	6.00	-6.20	-3.40

Figure 5. Summary of enrichment analysis in TRRUST.

GO	Description	Count	%	Log10(P)	Log10(q)
TRR00230	Regulated by: E2F1	6	3.30	-3.50	-1.40
TRR00233	Regulated by: E2F4	3	1.60	-3.30	-1.20
TRR01419	Regulated by: TP53	6	3.30	-3.00	-1.00
TRR01259	Regulated by: SP3	5	2.70	-2.80	-0.92
TRR01277	Regulated by: STAT3	5	2.70	-2.40	-0.57
TRR00270	Regulated by: EP300	3	1.60	-2.10	-0.38

Figure 6. Summary of enrichment analysis in DisGeNET¹².

GO	Description	Count	%	Log10(P)	Log10(q)
C0376634	Craniofacial Abnormalities	25	14.00	-22.00	-18.00
C0008925	Cleft Palate	32	17.00	-19.00	-15.00
C0026010	Microphthalmos	18	9.80	-11.00	-7.30
C0000846	Agenesis	13	7.10	-10.00	-6.60
C0027794	Neural Tube Defects	16	8.70	-9.80	-6.20
C0038379	Strabismus	23	13.00	-9.50	-5.90
C1876203	Frontonasal dysplasia	6	3.30	-9.30	-5.80
C0432106	Midline facial cleft - Tessier cleft 0	5	2.70	-9.20	-5.80
C0152423	Congenital small ears	11	6.00	-8.80	-5.40
C4049796	Abnormality of cardiovascular system morphology	12	6.60	-8.10	-4.80
C0018784	Sensorineural Hearing Loss (disorder)	22	12.00	-8.00	-4.80
C0020534	Orbital separation excessive	19	10.00	-7.90	-4.70
C0740404	Limb defects	8	4.40	-7.90	-4.70
C1836542	Depressed nasal bridge	16	8.70	-7.70	-4.50
C0266589	Congenital ear anomaly NOS (disorder)	10	5.50	-7.60	-4.50
C1306710	Facial asymmetry	9	4.90	-7.40	-4.30
C2981150	Uranostaphyloschisis	11	6.00	-7.30	-4.20
C0040588	Tracheoesophageal Fistula	8	4.40	-7.30	-4.20
C0005745	Blepharoptosis	18	9.80	-7.10	-4.10
C0033377	Ptosis	18	9.80	-7.00	-4.00

Figure 7. Summary of enrichment analysis in PaGenBase¹³.

GO	Description	Count	%	Log10(P)	Log10(q)
PGB:00126	Cell-specific: Testis cell	5	2.70	-6.20	-3.40
PGB:00007	Tissue-specific: colon	11	6.00	-4.80	-2.50
PGB:00107	Cell-specific: SHSYSY-RA	4	2.20	-3.70	-1.50
PGB:00060	Tissue-specific: retinoblastoma	5	2.70	-3.40	-1.30
PGB:00101	Tissue-specific: Colorectal adenocarcinoma	3	1.60	-2.60	-0.75
PGB:00019	Cell-specific: H1-hesc	8	4.40	-2.00	-0.30

Figure 8. Summary of enrichment analysis in COVID¹⁴.

GO	Description	Count	%	Log10(P)	Log10(q)
COVID054	RNA_Xiong_PBMC_Up	18	9.80	-12.00	-8.10
COVID007	RNA_Blanco-Melo_A549_Down	13	7.10	-7.00	-4.00
COVID015	RNA_Blanco-Melo_Calu-3_Down	11	6.00	-5.30	-2.80
COVID134	Proteome_Stukalov_A549-ACE2_24h_Down	10	5.50	-4.60	-2.20
COVID147	Proteome_Stukalov_A549_72h_NSP13_Up	6	3.30	-3.90	-1.70
COVID071	Proteome_Bouhaddou_Vero_E6_24h_Down	5	2.70	-3.40	-1.30
COVID315	RNA_Wilk_CD8+T-cells_patient-C6_Up	3	1.60	-2.60	-0.75
COVID313	RNA_Wilk_CD8+T-cells_patient-C5_Up	3	1.60	-2.60	-0.70
COVID049	RNA_Wyler_Calu-3_24h_Down	7	3.80	-2.40	-0.61
COVID347	RNA_Wilk_B-cells_patient-C6_Up	4	2.20	-2.20	-0.45
COVID185	Proteome_Stukalov_A549_72h_S_Up	3	1.60	-2.10	-0.35

Reference

Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
Szklarczyk D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2019) 47:D607-613.
Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.
Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550 (2005).
Pinero J, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017).
Pan JB, et al. PaGenBase: a pattern gene database for the global and dynamic understanding of gene function. PLoS One 8, e80747 (2013).
https://metascape.org/COVID.