The Exploration page allows users to explore data in the GDC using advanced filters/facets, which includes those on a gene and mutation level. Users choose filters on specific
Mutations on the left of this page and then can visualize these results on the right. The Gene/Mutation data for these visualizations comes from the Open-Access MAF files on the GDC Portal.
Filters / Facets
On the left of this page, users can create advanced filters to narrow down results to create synthetic cohorts.
The first tab of filters is for cases in the GDC.
These criteria limit the results only to specific cases within the GDC. The default filters available are:
- Case: Specify individual cases using submitter ID (barcode), UUID, or list of Cases ('Case Set')
- Case Submitter ID: Search for cases using a part (prefix) of the submitter ID (barcode).
- Primary Site: Anatomical site of the cancer under investigation or review.
- Program: A cancer research program, typically consisting of multiple focused projects.
- Project: A cancer research project, typically part of a larger cancer research program.
- Disease Type: Type of cancer studied.
- Gender: Gender of the patient.
- Age at Diagnosis: Patient age at the time of diagnosis.
- Vital Status: Indicator of whether the patient was living or deceased at the date of last contact.
- Days to Death: Number of days from date of diagnosis to death of the patient.
- Race: Race of the patient.
- Ethnicity: Ethnicity of the patient.
In addition to the defaults, users can add additional case filters by clicking on the link titled 'Add a Case Filter'
Upload Case Set
Cases filters panel, instead of supplying cases one-by-one, users can supply a list of cases. Clicking on the
Upload Case Set button will launch a dialog as shown below, where users can supply a list of cases or upload a comma-separated text file of cases.
After supplying a list of cases, a table below will appear which indicates whether the case was found.
Submit will filter the results in the Exploration Page by those cases.
The second tab of filters is for genes affected by mutations in the GDC.
The second tab of filters are for specific genes. Users can filter by:
- Gene - Entering in a specific Gene Symbol, ID, or list of Genes ('Gene Set')
- Biotype - Classification of the type of gene according to Ensembl. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding. Examples of biotypes in each group are as follows:
- Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
- Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
- Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
- Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene
- Is Cancer Gene Census - Whether or not a gene is part of The Cancer Gene Census
Upload Gene Set
Genes filters panel, instead of supplying genes one-by-one, users can supply a list of genes. Clicking on the
Upload Gene Set button will launch a dialog as shown below, where users can supply a list of genes or upload a comma-separated text file of genes.
After supplying a list of genes, a table below will appear which indicates whether the gene was found.
Submit will filter the results in the Exploration Page by those genes.
The final tab of filters is for specific mutations.
Users can filter by:
- Mutation - Unique ID for that mutation. Users can use the following:
- UUID - c7c0aeaa-29ed-5a30-a9b6-395ba4133c63
- DNA Change - chr12:g.121804752delC
- COSMIC ID - COSM202522
- List of any mutation UUIDs or DNA Change id's ('Mutation Set')
- Consequence Type - Consequence type of this variation; sequence ontology terms
- Impact - A subjective classification of the severity of the variant consequence. This information comes from the Ensembl VEP.
- Type - A general classification of the mutation
- Variant Caller - The variant caller used to identify the mutation
- COSMIC ID - The identifier of the gene or mutation maintained in COSMIC, the Catalogue Of Somatic Mutations In Cancer
- dbSNP rs ID - The reference SNP identifier maintained in dbSNP
Upload Mutation Set
Mutations filters panel, instead of supplying mutation id's one-by-one, users can supply a list of mutations. Clicking on the
Upload Mutation Set button will launch a dialog as shown below, where users can supply a list of mutations or upload a comma-separated text file of mutations.
After supplying a list of mutations, a table below will appear which indicates whether the mutation was found.
Submit will filter the results in the Exploration Page by those mutations.
As users add filters to the data on the Exploration Page, the Results section will automatically be updated. Results are divided into different tabs:
To illustrate these tabs, Case, Gene, and Mutation filters have been chosen ( Genes in the Cancer Gene Census, that have HIGH VEP Impact for the TCGA-BRCA project) and a description of what each tab displays follows.
Cases tab gives an overview of all the cases/patients who correspond to the filters chosen (Cohort).
The top of this section contains a few pie graphs with categorical information regarding the Primary Site, Project, Disease Type, Gender, and Vital Status.
Below these pie charts is a tabular view of cases (which can be exported, sorted and saved using the buttons on the right), that includes the following information:
- Case ID (Submitter ID): The Case ID / submitter ID of that case/patient (i.e. TCGA Barcode)
- Project: The study name for the project for which the case belongs
- Primary Site: The primary site of the cancer/project
- Gender: The gender of the case
- Files: The total number of files available for that case
- Available Files per Data Category: Five columns displaying the number of files available in each of the five data categories. These link to the files for the specific case.
- # Mutations: The number of SSMs (simple somatic mutations) detected in that case
- # Genes: The number of genes affected by mutations in that case
Note: By default, the Case UUID is not displayed. You can display the UUID of the case, but clicking on the icon with 3 parallel lines, and choose to display the Case UUID
Genes tab will give an overview of all the genes that match the criteria of the filters (Cohort).
The top of this section contains a survival plot of all the cases within the specified Exploration page search, in addition to a bar graph of the most frequently mutated genes. Hovering over each bar in the plot will display information about the percentage of cases affected. Users may choose to download the underlying data in JSON or TSV format or an image of the graph in SVG or PNG format by clicking the
download icon at the top of each graph.
Below these graphs is a tabular view of the genes affected, which includes the following information:
- Symbol: The gene symbol, which links to the Gene Summary Page
- Name: Full name of the gene
- Cytoband: The location of the mutation on the chromosome in terms of Giemsa-stained samples.
- Type: The type of gene
- # Affected Cases in Cohort: The number of cases affected in the Cohort
- # Affected Cases Across all Projects: The number of cases within all the projects in the GDC that contain a mutation on this gene. Clicking the red arrow will display the cases broken down by project
- # Mutations: The number of SSMs (simple somatic mutations) detected in that gene
- Annotations: Includes a COSMIC symbol if the gene belongs to The Cancer Gene Census
- Survival Analysis: An icon that, when clicked, will plot the survival rate between cases in the project with mutated and non-mutated forms of the gene
Survival analysis is used to analyze the occurrence of event data over time. In the GDC, survival analysis is performed on the mortality of the cases. Survival analysis requires:
- Data on the time to a particular event (days to death or last follow up)
- Fields: diagnoses.days_to_death and diagnoses.days_to_last_follow_up
- Information on whether the event has occurred (alive/deceased)
- Fields: diagnoses.vital_status
- Data split into different categories or groups (i.e. gender, etc.)
- Fields: demographic.gender
The survival analysis in the GDC uses a Kaplan-Meier estimator:
- S(ti) is the estimated survival probability for any particular one of the t time periods
- ni is the number of subjects at risk at the beginning of time period ti
- and di is the number of subjects who die during time period ti
The table below is an example data set to calculate survival for a set of seven cases:
The calculated cumulated survival probability can be plotted against the interval to obtain a survival plot like the one shown below.
Mutations tab will give an overview of all the mutations who match the criteria of the filters (Cohort).
At the top of this tab is a survival plot of all the cases within the specified exploration page filters.
A table is displayed below that lists information about each mutation:
- DNA Change: The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele
- Type: A general classification of the mutation
- Consequences: The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript). A link to the Gene Summary Page for the gene affected by the mutation is included
- # Affected Cases in Cohort: The number of affected cases in the Cohort as a fraction and as a percentage
- # Affected Cases in Across all Projects: The number of affected cases, expressed as number across all projects. This information comes from the Ensembl VEP. Choosing the arrow next to the percentage will display a breakdown of each affected project
- Impact (VEP): A subjective classification of the severity of the variant consequence. The categories are:
- HIGH (H): The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function, or triggering nonsense mediated decay
- MODERATE (M): A non-disruptive variant that might change protein effectiveness
- LOW (L): Assumed to be mostly harmless or unlikely to change protein behavior
- MODIFIER (MO): Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact
- Survival Analysis: An icon that when clicked, will plot the survival rate between the gene's mutated and non-mutated cases
Note: By default, the Mutation UUID is not displayed. You can display the UUID of the case, but clicking on the icon with 3 parallel lines, and choose to display the Mutation UUID
The Exploration page includes an OncoGrid plot of the cases with the most mutations, for the top 50 mutated genes affected by high impact mutations. Genes displayed on the left of the grid (Y-axis) correspond to individual cases on the bottom of the grid (X-axis).
The grid is color-coded with a legend at the top left which describes what type of mutation consequence is observed for each gene/case combination. Clinical information and the available data for each case are available at the bottom of the grid.
The right side of the grid displays additional information about the genes:
- Gene Sets: Describes whether a gene is part of The Cancer Gene Census. (The Cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer)
- GDC: Identifies all cases in the GDC affected with a mutation in this gene
To facilitate readability and comparisons, drag-and-drop can be used to reorder the gene rows. Double clicking a row in the "# Cases Affected" bar at the right side of the graphic launches the respective Gene Summary Page page. Hovering over a cell will display information about the mutation such as its ID, affected case, and biological consequence. Clicking on the cell will bring the user to the respective Mutation Summary page.
A tool bar at the top right of the graphic allows the user to export the data as a JSON object, PNG image, or SVG image. Seven buttons are available in this toolbar:
- Download: Users can choose to export the contents either to a static image file (PNG or SVG format) or the underlying data in JSON format
- Reload Grid: Sets all OncoGrid rows, columns, and zoom levels back to their initial positions
- Cluster Data: Clusters the rows and columns to place mutated genes with the same cases and cases with the same mutated genes together
- Toggle Heatmap: The view can be toggled between cells representing mutation consequences or number of mutations in each gene
- Toggle Gridlines: Turn the gridlines on and off
- Toggle Crosshairs: Turns crosshairs on, so that users can zoom into specific sections of the OncoGrid
- Fullscreen: Turns Fullscreen mode on/off
After utilizing the Exploration Page to narrow down a specific cohort, users can find the specific files that relate to this group by clicking on the
View Files in Repository button as shown in the image below.
Clicking this button will navigate the users to the Repository Page, filtered by the cases within the cohort.
The filters chosen on the Exploration page are displayed as an
input set on the Repository page. Additional filters may be added on top of this
input set, but the original set cannot be modified and instead must be created from scratch again.