Repository

Summary

The Repository Page is the primary method of accessing data in the GDC Data Portal. It provides an overview of all cases and files available in the GDC and offers users a variety of filters for identifying and browsing cases and files of interest. Users can access the Repository Page from the GDC Data Portal front page, from the Data Portal toolbar, or directly at https://portal.gdc.cancer.gov/repository.

Filters / Facets

On the left, a panel of data facets allows users to filter cases and files using a variety of criteria. If facet filters are applied, the tabs on the right will display information about matching cases and files. If no filters are applied, the tabs on the right will display information about all available data.

On the right, two tabs contain information about available data:

  • Files tab provides a list of files, select information about each file, and links to individual file detail pages.
  • Cases tab provides a list of cases, select information about each case, and links to individual case summary pages

The banner above the tabs on the right displays any active facet filters and provides access to advanced search.

The top of the Repository Page contains a few summary pie charts for Primary Sites, Projects, Disease Type, Gender, and Vital Status. These reflect all available data or, if facet filters are applied, only the data that matches the filters. Clicking on a specific slice in a pie chart, or on a number in a table, applies corresponding facet filters.

Data View

Facets Panel

Facets represent properties of the data that can be used for filtering. The facets panel on the left allows users to filter the cases and files presented in the tabs on the right.

The facets panel is divided into two tabs, with the Files tab containing facets pertaining to data files and experimental strategies, while the Cases tab containing facets pertaining to the cases and biospecimen information. Users can apply filters in both tabs simultaneously. The applied filters will be displayed in the banner above the tabs on the right, with the option to open the filter in Advanced Search to further refine the query.

The Getting Started section provides instructions on using facet filters. In the following example, a filter from the Cases tab ("primary site") and filters from the Files tab ("data category", "experimental strategy") are both applied:

Facet Filters Applied in Data View

The default set of facets is listed below.

Files facets tab:

  • File: Specify individual files using filename or UUID.
  • Data Category: A high-level data file category, such as "Raw Sequencing Data" or "Transcriptome Profiling".
  • Data Type: Data file type, such as "Aligned Reads" or "Gene Expression Quantification". Data Type is more granular than Data Category.
  • Experimental Strategy: Experimental strategies used for molecular characterization of the cancer.
  • Workflow Type: Bioinformatics workflow used to generate or harmonize the data file.
  • Data Format: Format of the data file.
  • Platform: Technological platform on which experimental data was produced.
  • Access Level: Indicator of whether access to the data file is open or controlled.

Cases facets tab:

  • Case: Specify individual cases using submitter ID (barcode) or UUID.
  • Case Submitter ID Prefix: Search for cases using a part (prefix) of the submitter ID (barcode).
  • Primary Site: Anatomical site of the cancer under investigation or review.
  • Cancer Program: A cancer research program, typically consisting of multiple focused projects.
  • Project: A cancer research project, typically part of a larger cancer research program.
  • Disease Type: Type of cancer studied.
  • Gender: Gender of the patient.
  • Age at Diagnosis: Patient age at the time of diagnosis.
  • Vital Status: Indicator of whether the patient was living or deceased at the date of last contact.
  • Days to Death: Number of days from date of diagnosis to death of the patient.
  • Race: Race of the patient.
  • Ethnicity: Ethnicity of the patient.

Adding Custom Facets

The Repository Page provides access to additional data facets beyond those listed above. Facets corresponding to additional properties listed in the GDC Data Dictionary can be added using the "add a filter" links available at the top of the Cases and Files facet tabs:

Add a Facet

The links open a search window that allows the user to find an additional facet by name or description. Not all facets have values available for filtering; checking the "Only show fields with values" checkbox will limit the search results to only those that do. Selecting a facet from the list of search results below the search box will add it to the facets panel.

Search for a Facet

Newly added facets will show up at the top of the facets panel and can be removed individually by clicking on the red cross to the right of the facet name. The default set of facets can be restored by clicking "Reset".

Customize Facet

Results

Files List

The Files tab on the right provides a list of available files and select information about each file. If facet filters are applied, the list includes only matching files. Otherwise, the list includes all data files available in the GDC Data Portal.

Files Tab

The File Name column includes links to file detail pages where the user can learn more about each file.

Users can add individual file(s) to the file cart using the cart button next to each file. Alternatively, all files that match the current facet filters can be added to the cart using the menu in the top left corner of the table:

Files Tab

Cases List

The Cases tab on the right provides a list of available cases and select information about each case. If facet filters are applied, the list includes only matching cases. Otherwise, the list includes all cases available in the GDC Data Portal.

Cases Tab

The list includes links to case summary pages in the Case UUID column, the Submitter ID (i.e. TCGA Barcode), and counts of the available file types for each case. Clicking on a count will apply facet filters to display the corresponding files.

The list also includes a shopping cart button, allowing the user to add all files associated with a case to the file cart for downloading at a later time:

Cases Tab, Add to Cart

After utilizing the Repository Page to narrow down a specific set of cases, users can continue to explore the mutations and genes affected by these cases by clicking the View Files in Repository button as shown in the image below.

Exploration File Navigation

Clicking this button will navigate the users to the Exploration Page, filtered by the cases within the cohort.

Case Summary Page

The Case Summary page displays case details including the project and disease information, data files that are available for that case, and the experimental strategies employed. A button in the top-right corner of the page allows the user to add all files associated with the case to the file cart.

Case Page

Clinical and Biospecimen Information

The page also provides clinical and biospecimen information about that case. Links to export clinical and biospecimen information in JSON format are provided.

Case Page, Clinical and Biospecimen

For clinical records that support multiple records of the same type (Diagnoses, Family Histories, or Exposures), a UUID of the record is provided on the left hand side of the corresponding tab, allowing the user to select the entry of interest.

A search filter just below the biospecimen section can be used to find and filter biospecimen data. The wildcard search will highlight entities in the tree that match the characters typed. This will search both the case submitter ID, as well as the additional metadata for each entity. For example, searching 'Primary Tumor' will highlight samples that match that type.

Biospecimen Search

Most Frequent Somatic Mutations

The case entity page also lists the mutations found in that particular case.

Case Page

The table lists the following information for each mutation

  • DNA Change: The chromosome and starting coordinates of the mutation are displayed along with the nucleotide differences between the reference and tumor allele
  • Type: A general classification of the mutation
  • Consequences: The effects the mutation has on the gene coding for a protein (i.e. synonymous, missense, non-coding transcript)
  • # Affected Cases in Project: The number of affected cases, expressed as number across all mutations within the Project
  • # Affected Cases Across GDC: The number of affected cases, expressed as number across all projects. Choosing the arrow next to the percentage will expand the selection with a breakdown of each affected project
  • Impact (VEP): A subjective classification of the severity of the variant consequence. This information comes from the Ensembl VEP. The categories are:
  • HIGH (H): The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay
  • MODERATE (M): A non-disruptive variant that might change protein effectiveness
  • LOW (L): Assumed to be mostly harmless or unlikely to change protein behavior
  • MODIFIER (MO): Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact

Clicking on the Open in Exploration button at the top right of this section will navigate the user to the Exploration page, filtered on this case.

File Summary Page

The File Summary page provides information a data file, including file properties like size, md5 checksum, and data format; information on the type of data included; links to the associated case and biospecimen; and information about how the data file was generated or processed.

The page also includes buttons to download the file, add it to the file cart, or (for BAM files) utilize the BAM slicing function.

Files Detail Page

In the lower section of the screen, the following tables provide more details about the file and its characteristics:

  • Associated Cases / Biospecimen: List of Cases or biospecimen the file is directly attached to.
  • Analysis and Reference Genome: Information on the workflow and reference genome used for file generation.
  • Read Groups: Information on the read groups associated with the file.
  • Metadata Files: Experiment metadata, run metadata and analysis metadata associated with the file
  • Downstream Analysis Files: List of downstream analysis files generated by the file

Files Entity Page

Note: The Legacy Archive will not display "Workflow, Reference Genome and Read Groups" sections (these sections are applicable to the GDC harmonization pipeline only). However it may provide information on Archives and metadata files like MAGE-TABs and SRA XMLs. For more information, please refer to the section Legacy Archive.

BAM Slicing

BAM file detail pages have a "BAM Slicing" button. This function allows the user to specify a region of a BAM file for download. Clicking on it will open the BAM slicing window:

BAM Slicing Window

During preparation of the slice, the icon on the BAM Slicing button will be spinning, and the file will be offered for download to the user as soon as ready.