Metascape Gene List Analysis Report

metascape.org1

Bar Graph Summary

Figure 1. Bar graph of enriched terms across input gene lists, colored by p-values.
Metascape only visualizes the top 20 clusters. Up to 100 enriched clusters can be viewed here.

Gene Lists

User-provided gene identifiers are first converted into their corresponding H. sapiens Entrez gene IDs using the latest version of the database (last updated on 2021-05-01). If multiple identifiers correspond to the same Entrez gene ID, they will be considered as a single Entrez gene ID in downstream analyses. The gene lists are summarized in Table 1.

Table 1. Statistics of input gene lists.
Name Total Unique
MyList 238 187

Gene Annotation

The following are the list of annotations retrieved from the latest version of the database (last updated on 2021-05-01) (Table 2).

Table 2. Gene annotations extracted
Name Type Description
Gene Symbol Description Primary HUGO gene symbol.
Description Description Short description.
Biological Process (GO) Function/Location Descriptions summarized based on gene ontology database, where up to three most informative GO terms are kept.
Kinase Class (UniProt) Function/Location Detailed kinase classes.
Protein Function (Protein Atlas) Function/Location Protein Function (Protein Atlas)
Subcellular Location (Protein Atlas) Function/Location Sucellular Location (Protein Atlas)
Drug (DrugBank) Genotype/Phenotype/Disease Drug information for the given gene as target.
Canonical Pathways Ontology Canonical Pathways
Hallmark Gene Sets Ontology Hallmark Gene Sets

Pathway and Process Enrichment Analysis

For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, CORUM, TRRUST, DisGeNET, PaGenBase, Transcription Factor Targets, WikiPathways, PANTHER Pathway and COVID. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values are calculated based on the accumulative hypergeometric distribution2, and q-values are calculated using the Banjamini-Hochberg procedure to account for multiple testings3. Kappa scores4 are used as the similarity metric when performing hierachical clustering on the enriched terms, and sub-trees with a similarity of > 0.3 are considered a cluster. The most statistically significant term within a cluster is chosen to represent the cluster.

Table 3. Top 20 clusters with their representative enriched terms (one per cluster). "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the user-provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
GO Category Description Count % Log10(P) Log10(q)
GO:0048598 GO Biological Processes embryonic morphogenesis 40 21.86 -28.95 -24.59
GO:0003002 GO Biological Processes regionalization 31 16.94 -26.29 -22.23
WP2064 WikiPathways Neural Crest Differentiation 20 10.93 -23.49 -19.60
GO:0007423 GO Biological Processes sensory organ development 33 18.03 -22.01 -18.35
GO:0021953 GO Biological Processes central nervous system neuron differentiation 20 10.93 -18.35 -14.77
GO:0045165 GO Biological Processes cell fate commitment 21 11.48 -16.69 -13.37
GO:0045665 GO Biological Processes negative regulation of neuron differentiation 10 5.46 -11.00 -8.11
GO:0048729 GO Biological Processes tissue morphogenesis 23 12.57 -10.47 -7.62
GO:0048736 GO Biological Processes appendage development 13 7.10 -10.11 -7.28
GO:0007517 GO Biological Processes muscle organ development 16 8.74 -9.49 -6.72
GO:0051301 GO Biological Processes cell division 21 11.48 -9.36 -6.60
GO:0009953 GO Biological Processes dorsal/ventral pattern formation 9 4.92 -8.73 -6.01
GO:0019827 GO Biological Processes stem cell population maintenance 11 6.01 -8.48 -5.80
GO:0051216 GO Biological Processes cartilage development 12 6.56 -8.46 -5.79
GO:0035270 GO Biological Processes endocrine system development 10 5.46 -8.11 -5.47
GO:1904888 GO Biological Processes cranial skeletal system development 8 4.37 -8.00 -5.38
GO:0048483 GO Biological Processes autonomic nervous system development 7 3.83 -7.72 -5.14
GO:0030902 GO Biological Processes hindbrain development 10 5.46 -7.57 -5.01
ko04550 KEGG Pathway Signaling pathways regulating pluripotency of stem cells 10 5.46 -7.57 -5.01
WP2855 WikiPathways Dopaminergic Neurogenesis 6 3.28 -7.44 -4.91

To further capture the relationships between the terms, a subset of enriched terms have been selected and rendered as a network plot, where terms with a similarity > 0.3 are connected by edges. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total. The network is visualized using Cytoscape5, where each node represents an enriched term and is colored first by its cluster ID (Figure 2.a) and then by its p-value (Figure 2.b). These networks can be interactively viewed in Cytoscape through the .cys files (contained in the Zip package, which also contains a publication-quality version as a PDF) or within a browser by clicking on the web icon. For clarity, term labels are only shown for one term per cluster, so it is recommended to use Cytoscape or a browser to visualize the network in order to inspect all node labels. We can also export the network into a PDF file within Cytoscape, and then edit the labels using Adobe Illustrator for publication purposes. To switch off all labels, delete the "Label" mapping under the "Style" tab within Cytoscape, and then export the network view.

Figure 2. Network of enriched terms: (a) colored by cluster ID, where nodes that share the same cluster ID are typically close to each other; (b) colored by p-value, where terms containing more genes tend to have a more significant p-value.

Protein-protein Interaction Enrichment Analysis

For each given gene list, protein-protein interaction enrichment analysis has been carried out with the following databases: STRING6, BioGrid7, OmniPath8, InWeb_IM9.Only physical interactions in STRING (physical score > 0.132) and BioGrid are used (details). The resultant network contains the subset of proteins that form physical interactions with at least one other member in the list. If the network contains between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm10 has been applied to identify densely connected network components. The MCODE networks identified for individual gene lists have been gathered and are shown in Figure 3.

Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by p-value have been retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots within Figure 3.

Figure 3. Protein-protein interaction network and MCODE components identified in the gene lists.
GO Description Log10(P)
GO:0048598 embryonic morphogenesis -26.4
WP2064 Neural Crest Differentiation -22.1
GO:0007423 sensory organ development -19.8
Color MCODE GO Description Log10(P)
MCODE_1 GO:0007423 sensory organ development -7.4
MCODE_1 GO:0001708 cell fate specification -7.4
MCODE_1 GO:0048598 embryonic morphogenesis -7.3
MCODE_2 GO:0051301 cell division -14.1
MCODE_2 GO:0140014 mitotic nuclear division -11.8
MCODE_2 R-HSA-983189 Kinesins -11.1
MCODE_3 R-HSA-2559584 Formation of Senescence-Associated Heterochromatin Foci (SAHF) -8.8
MCODE_3 R-HSA-2559583 Cellular Senescence -7.9
MCODE_3 R-HSA-2559586 DNA Damage/Telomere Stress Induced Senescence -6.7
MCODE_4 GO:0048598 embryonic morphogenesis -8.5
MCODE_4 GO:0009954 proximal/distal pattern formation -7.9
MCODE_4 GO:0048562 embryonic organ morphogenesis -5.0
MCODE_5 R-HSA-8948216 Collagen chain trimerization -11.3
MCODE_5 M3005 NABA COLLAGENS -11.3
MCODE_5 R-HSA-1442490 Collagen degradation -10.6
MCODE_6 R-HSA-452723 Transcriptional regulation of pluripotent stem cells -9.3
MCODE_6 GO:0035019 somatic stem cell population maintenance -7.8
MCODE_6 GO:0019827 stem cell population maintenance -6.9
MCODE_7 GO:0034653 retinoic acid catabolic process -11.6
MCODE_7 GO:0016103 diterpenoid catabolic process -11.6
MCODE_7 GO:0016115 terpenoid catabolic process -11.0

Quality Control and Association Analysis

Gene list enrichments are identified in the following ontology categories: Transcription_Factor_Targets, TRRUST, DisGeNET, PaGenBase, COVID. All genes in the genome have been used as the enrichment background. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. The top few enriched clusters (one term per cluster) are shown in the Figure 4-8. The algorithm used here is the same as that is used for pathway and process enrichment analysis.

Figure 4. Summary of enrichment analysis in Transcription_Factor_Targets11.


GO Description Count % Log10(P) Log10(q)
M30019 HSD17B8 TARGET GENES 40 22.00 -27.00 -22.00
M8172 HNF6 Q6 17 9.30 -12.00 -8.50
M4815 WTTGKCTG UNKNOWN 19 10.00 -8.80 -5.40
M18169 YATGNWAAT OCT C 16 8.70 -8.60 -5.20
M6325 SMAD3 Q6 13 7.10 -8.10 -4.80
M946 GGATTA PITX2 Q2 19 10.00 -7.90 -4.70
M6490 IPF1 Q4 13 7.10 -7.80 -4.60
M11587 CHX10 01 12 6.60 -7.40 -4.30
M3902 POU1F1 Q6 12 6.60 -7.20 -4.10
M5708 OCT1 05 12 6.60 -6.90 -3.90
M1328 WTGAAAT UNKNOWN 18 9.80 -6.80 -3.90
M4225 NFY 01 12 6.60 -6.80 -3.90
M9868 NFY Q6 12 6.60 -6.80 -3.80
M14960 POU3F2 02 12 6.60 -6.70 -3.80
M19088 TST1 01 12 6.60 -6.70 -3.80
M15623 SOX5 01 12 6.60 -6.70 -3.80
M4764 TGTTTGY HNF3 Q6 19 10.00 -6.30 -3.50
M16345 NKX3A 01 11 6.00 -6.30 -3.50
M6135 POU6F1 01 11 6.00 -6.30 -3.50
M1224 TCF4 Q5 11 6.00 -6.20 -3.40
Figure 5. Summary of enrichment analysis in TRRUST.


GO Description Count % Log10(P) Log10(q)
TRR00230 Regulated by: E2F1 6 3.30 -3.50 -1.40
TRR00233 Regulated by: E2F4 3 1.60 -3.30 -1.20
TRR01419 Regulated by: TP53 6 3.30 -3.00 -1.00
TRR01259 Regulated by: SP3 5 2.70 -2.80 -0.92
TRR01277 Regulated by: STAT3 5 2.70 -2.40 -0.57
TRR00270 Regulated by: EP300 3 1.60 -2.10 -0.38
Figure 6. Summary of enrichment analysis in DisGeNET12.


GO Description Count % Log10(P) Log10(q)
C0376634 Craniofacial Abnormalities 25 14.00 -22.00 -18.00
C0008925 Cleft Palate 32 17.00 -19.00 -15.00
C0026010 Microphthalmos 18 9.80 -11.00 -7.30
C0000846 Agenesis 13 7.10 -10.00 -6.60
C0027794 Neural Tube Defects 16 8.70 -9.80 -6.20
C0038379 Strabismus 23 13.00 -9.50 -5.90
C1876203 Frontonasal dysplasia 6 3.30 -9.30 -5.80
C0432106 Midline facial cleft - Tessier cleft 0 5 2.70 -9.20 -5.80
C0152423 Congenital small ears 11 6.00 -8.80 -5.40
C4049796 Abnormality of cardiovascular system morphology 12 6.60 -8.10 -4.80
C0018784 Sensorineural Hearing Loss (disorder) 22 12.00 -8.00 -4.80
C0020534 Orbital separation excessive 19 10.00 -7.90 -4.70
C0740404 Limb defects 8 4.40 -7.90 -4.70
C1836542 Depressed nasal bridge 16 8.70 -7.70 -4.50
C0266589 Congenital ear anomaly NOS (disorder) 10 5.50 -7.60 -4.50
C1306710 Facial asymmetry 9 4.90 -7.40 -4.30
C2981150 Uranostaphyloschisis 11 6.00 -7.30 -4.20
C0040588 Tracheoesophageal Fistula 8 4.40 -7.30 -4.20
C0005745 Blepharoptosis 18 9.80 -7.10 -4.10
C0033377 Ptosis 18 9.80 -7.00 -4.00
Figure 7. Summary of enrichment analysis in PaGenBase13.


GO Description Count % Log10(P) Log10(q)
PGB:00126 Cell-specific: Testis cell 5 2.70 -6.20 -3.40
PGB:00007 Tissue-specific: colon 11 6.00 -4.80 -2.50
PGB:00107 Cell-specific: SHSYSY-RA 4 2.20 -3.70 -1.50
PGB:00060 Tissue-specific: retinoblastoma 5 2.70 -3.40 -1.30
PGB:00101 Tissue-specific: Colorectal adenocarcinoma 3 1.60 -2.60 -0.75
PGB:00019 Cell-specific: H1-hesc 8 4.40 -2.00 -0.30
Figure 8. Summary of enrichment analysis in COVID14.


GO Description Count % Log10(P) Log10(q)
COVID054 RNA_Xiong_PBMC_Up 18 9.80 -12.00 -8.10
COVID007 RNA_Blanco-Melo_A549_Down 13 7.10 -7.00 -4.00
COVID015 RNA_Blanco-Melo_Calu-3_Down 11 6.00 -5.30 -2.80
COVID134 Proteome_Stukalov_A549-ACE2_24h_Down 10 5.50 -4.60 -2.20
COVID147 Proteome_Stukalov_A549_72h_NSP13_Up 6 3.30 -3.90 -1.70
COVID071 Proteome_Bouhaddou_Vero_E6_24h_Down 5 2.70 -3.40 -1.30
COVID315 RNA_Wilk_CD8+T-cells_patient-C6_Up 3 1.60 -2.60 -0.75
COVID313 RNA_Wilk_CD8+T-cells_patient-C5_Up 3 1.60 -2.60 -0.70
COVID049 RNA_Wyler_Calu-3_24h_Down 7 3.80 -2.40 -0.61
COVID347 RNA_Wilk_B-cells_patient-C6_Up 4 2.20 -2.20 -0.45
COVID185 Proteome_Stukalov_A549_72h_S_Up 3 1.60 -2.10 -0.35

Reference

  1. Zhou et al., Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature Communications (2019) 10(1):1523.
  2. Zar, J.H. Biostatistical Analysis 1999 4th edn., NJ Prentice Hall, pp. 523
  3. Hochberg Y., Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine (1990) 9:811-818.
  4. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. (1960) 20:27-46.
  5. Shannon P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 11:2498-2504.
  6. Szklarczyk D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. (2019) 47:D607-613.
  7. Stark C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. (2006) 34:D535-539.
  8. Turei D. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2016) 13:966-967.
  9. Li T. et al. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods. (2017) 14:61-64.
  10. Bader, G.D. et al. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics (2003) 4:2.
  11. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550 (2005).
  12. Pinero J, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research 45, D833-D839 (2017).
  13. Pan JB, et al. PaGenBase: a pattern gene database for the global and dynamic understanding of gene function. PLoS One 8, e80747 (2013).
  14. https://metascape.org/COVID.