Submission Best Practices

Because of the data types and relationships included in the GDC, data submission can become a complex procedure. The purpose of this section is to present guidelines that will aid in the incorporation and harmonization of submitters' data. Please contact the GDC Help Desk at support@nci-gdc.datacommons.io if you have any questions or concerns regarding a submission project.

Date Obfuscation

The GDC is committed to providing accurate and useful information as well as protecting the privacy of patients if necessary. Following this, the GDC accepts time intervals that were transformed to remove information that could identify an individual but preserve clinically useful timelines. The GDC recommends following a series of HIPAA regulations regarding the reporting of age-related information, which can be downloaded here as a PDF.

General Guidelines

Actual calendar dates are not reported in GDC clinical fields but the lengths of time between events are preserved. Time points are reported based on the number of days since the patient's initial diagnosis. Events that occurred after the initial diagnosis are reported as positive and events that occurred before are reported as negative. Dates are not automatically obfuscated by the GDC validation system and submitters are required to make these changes in their clinical data. This affects these fields: days_to_birth, days_to_death, days_to_last_follow_up, days_to_last_known_disease_status, days_to_recurrence, days_to_treatment

Note: The day-based fields take leap years into account.

Patients Older than 90 Years and Clinical Events

Because of the low population number within the demographic of patients over 90 years old, it becomes more likely that patients can potentially be identified by a combination of their advanced age and publicly available clinical data. Because of this, patients over 90 years old are reported as exactly 90 years or 32,872 days old.

Following this, clinical events that occur over 32,872 days are also capped at 32,872 days. When timelines are capped, the priority should be to shorten the post-diagnosis values to preserve the accuracy of the age of the patient (except for patients who were diagnosed at over 90 years old). Values such as days_to_death and days_to_recurrence should be compressed before days_to_birth is compressed.

Examples Timelines

Example 1: An 88 year old patient is diagnosed with cancer and dies 13 years later. The days_to_birth value is less than 32,872 days, so it can be accurately reported. However, between the initial diagnosis and death, the patient turned 90 years old. Since 32,872 is the maximum, days_to_death would be calculated as 32872 - 32142 = 730.

Dates

Date of Birth: 01-01-1900

Date of Initial Diagnosis: 01-01-1988

Date of Death: 01-01-2001

Actual-Values

days_to_birth: -32142

days_to_death: 4748

Obfuscated-Values

days_to_birth: -32142

days_to_death: 730

Example 2: A 98 year old patient is diagnosed with cancer and dies three years later. Because days_to_X values are counted from initial diagnosis, days will be at their maximum value of 32,872 upon initial diagnosis. This will compress the later dates and reduce days_to_birth to -32,872 and days_to_death to zero.

Dates

Date of Birth: 01-01-1900

Date of Initial Diagnosis: 01-01-1998

Date of Death: 01-01-2001

Actual-Values

days_to_birth: -35794

days_to_death: 1095

Obfuscated-Values

days_to_birth: -32872

days_to_death: 0

Array Submission

Certain fields in the GDC, such as diagnosis.sites_of_involvement, are of type "array". This allows multiple values to be submitted on one property. These values need to be uploaded in a |-delimited format for TSV formatted uploads and a JSON-type array for JSON formatted uploads. See the example below.

Example (TSV): Kidney, Upper Pole|Kidney, Middle (would appear under sites_of_involvement header)
Example (JSON): "sites_of_involvement" : ["Kidney, Upper Pole", "Kidney, Middle"]

Submitting Complex Data Model Relationships

The GDC Data Model includes relationships in which more than one entity of one type can be associated with one entity of another type. For example, more than one read_group entity can be associated with a submitted_aligned_reads entity. JSON-formatted files, in which a list object can be used, are well-suited to represent this type of relationship. Tab-delimited (TSV) files require additional syntax to demonstrate these relationships. For example, associating a submitted_aligned_reads entity to three read groups would require three read_groups.submitter_id columns, each with the # symbol and a number appended to them. See the two files below:

TSVJSON

type    submitter_id    data_category   data_format data_type   experimental_strategy   file_name   file_size   md5sum  read_groups.submitter_id#1 read_groups.submitter_id#2  read_groups.submitter_id#3
submitted_aligned_reads Alignment.bam  Raw Sequencing Data BAM Aligned Reads   WGS test_alignment.bam    123456789  aa6e82d11ccd8452f813a15a6d84faf1    READ_GROUP_1  READ_GROUP_2  READ_GROUP_3

{
    "type": "submitted_aligned_reads",
    "submitter_id": "Alignment.bam",
    "data_category": "Raw Sequencing Data",
    "data_format": "BAM",
    "data_type": "Aligned Reads",
    "experimental_strategy": "WGS",
    "file_name": "test_alignment.bam",
    "file_size": 123456789,
    "md5sum": "aa6e82d11ccd8452f813a15a6d84faf1",
    "read_groups": [
        {"submitter_id": "READ_GROUP_1"},
        {"submitter_id": "READ_GROUP_2"},
        {"submitter_id": "READ_GROUP_3"}
    ]
}

Read groups

Submitting Read Group Names

The read_group entity requires a read_group_name field for submission. If the read_group entity is associated with a BAM file, the submitter should use the @RG ID present in the BAM header as the read_group_name. This is important for the harmonization process and will reduce the possibility of errors.

Multiple FASTQs from One Read Group

To align reads according to their direction and pair, the GDC requires that unaligned forward and reverse reads are submitted as "submitted_unaligned_reads." When more than one FASTQ exists for a read group direction, the GDC requires that the FASTQ files are concatenated for each direction. In other words, each paired-end read group should be associated with exactly two FASTQ files (submitted_unaligned_reads).

Minimal and Recommended Read Group Information

In addition to the required properties on read_group we also recommend submitting flow_cell_barcode, lane_number and multiplex_barcode. This information can be used by our bioinformatics team and data downloaders to construct a Platform Unit (PU), which is a universally unique identifier that can be used to model various sequencing technical artifacts. More information can be found in the SAM specification PDF.

For projects with library strategies of targeted sequencing or WXS we also require information on the target capture protocol included on target_capture_kit.

If this information is not provided it may cause a delay in the processing of submitted data.

Additional read group information will benefit data users. Such information can be used by bioinformatics pipelines and will aid understanding and mitigation of batch effects. If available, you should also provide as many of the remaining read group properties as possible.

Submission File Quality Control

The GDC harmonization pipelines include multiple quality control steps to ensure the integrity of harmonized files and the efficient use of computational resources. For fastest turnaround of data processing we recommend that submitters perform basic QC of their data files prior to upload to identify any underlying data quality issues. This may include such tests as verifying expected genome coverage levels and sequencing quality.

Except for miRNA data submission, sequencing data for all other experimental strategy types (i.e. whole exome sequencing, whole genome sequencing, targeted sequencing and mRNA sequencing) do not need to be trimmed by submitters prior to submission, as tools used in GDC alignment workflows are capable of handling adaptors and low quality bases correctly.

Target Capture Kit Q and A

What is a Target Capture Kit?
Target capture kits contain reagents designed to enrich for and thus increase sequencing depth in certain genomic regions before library preparation. Two of the major methods to enrich genomic regions are by Hybridization and by PCR (Amplicon).
Why do we need Target Capture Kit information?
Target region information is important for DNA-Seq variant calling and filtering, and essential for Copy Number Alternation and other analyses. This information is only needed for the Experimental Strategies of WXS or Targeted Sequencing.
How do submitters provide this information?
There are 3 steps
- Step 1. The submitter should contact GDC User Service about any new Target Capture Kits that do not already exist in the GDC Dictionary. The GDC Bioinformatics and User Services teams will work together with the submitter to create a meaningful name for the kit and import this name and Target Region Bed File into the GDC.
- Step 2. The submitter can then select one and only one GDC Target Capture Kit for each read group during molecular data submission.
- Step 3. The submitter should also select the appropriate library_strategy and library_selection on the read_group entity.
What is a Target Region Bed File?
A Target Region Bed File is tab-delimited file describing the kit target region in bed format. The first 3 columns of such files are chrom, chromStart, and chromEnd. Note that by definition, bed files are 0-based or "left-open, right-closed", which means bed interval "chr1 10 20" only contains 10 bases on chr1, from the 11th to the 20th. In addition, submitters should also let GDC know the genome build (hg18, hg19 or GRCh38) of their bed files.
Is a Target Capture Kit uniquely defined by its Target Region Bed File?
Not necessarily. Sometimes, users or manufactures may want to augment an existing kit with additional probes, in order to capture more regions or simply improve the quality of existing regions. In the latter case, the bed file stays the same, but it is now a different Target Capture Kit and should be registered separately as described in Step 3 above.

Specifying Tumor Normal Pairs for Analysis

It is critical for many cancer bioinformatics pipelines to specify which normal sample to use to factor out germline variation. In particular, this is a necessary specification for all tumor normal paired variant calling pipelines. The following details describe how the GDC determines which normal sample to use for variant calling.

Every tumor aliquot will be used for variant calling. For example, if 10 WXS tumor aliquots are submitted, the GDC will produce 10 alignments and 10 VCFs for each variant calling pipeline.
If there is only one normal we will use that normal for variant calling
If there are multiple normals of the same experimental_strategy for a case:
- Users can specify which normal to use by specifying on the aliquot. To do so one of the following should be set to TRUE for the specified experimental strategy: selected_normal_low_pass_wgs, selected_normal_targeted_sequencing, selected_normal_wgs, or selected_normal_wxs.
- Or if no normal is specified the GDC will select the best normal for that patient based on the following criteria. This same logic will also be used if multiple normal are selected.
  - If a case has blood cancer we will use sample type in the following priority order:
    
    Blood Derived Normal > Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Fibroblasts from Bone Marrow Normal > Lymphoid Normal > Buccal Cell Normal > Solid Tissue Normal > EBV Immortalized Normal
  - If a case does not have blood cancer we will use sample type in the following priority order:
    
    Solid Tissue Normal > Buccal Cell Normal > Lymphoid Normal > Fibroblasts from Bone Marrow Normal > Mononuclear Cells from Bone Marrow Normal > Bone Marrow Normal > Blood Derived Normal > EBV Immortalized Normal
  - If there are still ties, we will choose the aliquot submitted first.
  - If there are no normals.
    - The GDC will not run tumor only variant calling pipeline by default. The submitter must specify one of the following properties as TRUE: no_matched_normal_low_pass_wgs, no_matched_normal_targeted_sequencing, no_matched_normal_wgs, no_matched_normal_wxs.

Note that we will only run variant calling for a particular tumor aliquot per experimental strategy once. You must make sure that the appropriate normal control is uploaded to the GDC when Requesting Submission. Uploading a different normal sample later will not result in reanalysis by the GDC.

Submission of Single-Cell RNA-Seq Data

For any submitter that is uploading scRNA-Seq data, please follow these guidelines:

If the data is single-nuclei RNA-Seq, please populate the associated aliquot field analyte_type with Nuclei RNA.
Please only submit the molecular files as submitted_unaligned_reads in FASTQ format.
When submitting molecular files, please submit files with file names that are in the acceptable format for CellRanger input. The acceptable format follows this regular expression: SampleName_S[\d*]_L00[\d]_R{1,2}_001.fastq.gz, for example: SampleName_S1_L001_R1_001.fastq.gz

Clinical Data Requirements

For the GDC to release a project there is a minimum number of clinical properties that are required. Minimal cross-project GDC requirements include age, gender, and diagnosis information. Other requirements may be added when the submitter is approved for submission to the GDC.

miRNA Submission

The GDC requires that miRNA reads be adapter-trimmed before being uploaded to the GDC because miRNA datasets can have different trimming schemas. Uploading untrimmed miRNA reads will result in unusably low miRNA quantifications.

Slide Image Submission

To submit slide images to GDC, it is a requirement that images should not contain any label/captions as well as no macro views for compliance with patient information confidentiality.