Upload Data

This guide details step-by-step procedures for different aspects of the GDC Data Submission process and how they relate to the GDC Data Model and structure. The first sections of this guide break down the submission process and associate each step with the Data Model. Additional sections are detailed below for strategies on expediting data submission and using features of the Submission Portal.

GDC Data Model Basics

Pictured below is the submittable subset of the GDC Data Model: a roadmap for GDC data submission. The entities that make up the completed (submitted) portion of the submission process will be highlighted in blue.

GDC Data Model 1

Each entity type is represented with an oval in the above graphic. All submitted entities require a connection to another entity type, based on the GDC Data Model, and a submitter_id as an identifier.

The Case Entity and Clinical Data

The case is the center of the GDC Data Model and usually describes a specific patient. Each case is connected to a project. Different types of clinical data, such as diagnoses and exposures, are connected to the case to describe the case's attributes and medical information.

Biospecimen Data

One of the main features of the GDC is the genomic data harmonization workflow. Genomic data is connected the the case through biospecimen entities. The sample entity describes a biological piece of matter that originated from a case. Subsets of the sample such as portions and analytes can optionally be described. The aliquot originates from a sample or analyte and describes the nucleic acid extract that was sequenced. The read_group entity describes the resulting set of reads from one sequencing lane.

Experiment Data

Several types of experiment data can be uploaded to the GDC. The submitted_aligned_reads and submitted_unaligned_reads files are associated with the read_group entity. While the array-based files such as the submitted_tangent_copy_number are associated with the aliquot entity. Each of these file types are described in their respective entity submission and are uploaded separately using the API or the GDC Data Transfer Tool.

Program and Project Registration

Before submission can begin, the Program and Project must be approved and set by the GDC.

Program and Project Approval

Each new project must request submission access from the GDC. Before submission can commence the project must be registered at dbGaP along with the eRA commons IDs for all those users who will be uploading data. All cases (i.e. patients) must also be registered in dbGaP for that particular project. Once these steps are complete the GDC will grant submission access and create program and project names in consultation with the user based on the rules outlined below.

Program and Project Naming Conventions

The program is assigned a program.name, which uniquely identifies that program. Each program may have multiple projects and will be assigned a project.code, which uniquely identifies each project. The project_id is the main identifier for the project in the GDC system and comprises the program.name with the project.code appended to it with a dash. For example:

program.name = TCGA
project.code = BRCA
project.project_id = TCGA-BRCA

Case Submission

GDC Data Model 2

The main entity of the GDC Data Model is the case, each of which must be registered beforehand with dbGaP under a unique submitter_id. The first step to submitting a case is to consult the Data Dictionary, which details the fields that are associated with a case, the fields that are required to submit a case, and the values that can populate each field. Dictionary entries are available for all entities in the GDC Data Model.

Dictionary Case

Submitting a Case entity requires:

  • submitter_id: A unique key to identify the case
  • projects.code: A link to the project

The submitter ID is different from the universally unique identifier (UUID), which is based on the UUID Version 4 Naming Convention. The UUID can be accessed under the <entity_type>_id field for each entity. For example, the case UUID can be accessed under the case_id field. The UUID is either assigned to each entity automatically or can be submitted by the user. Submitter-generated UUIDs cannot be uploaded in submittable_data_file entity types. See the Data Model Users Guide for more details about GDC identifiers.

The projects.code field connects the case entity to the project entity. The rest of the entity connections use the submitter_id field instead.

The case entity can be added in JSON or TSV format. A template for any entity in either of these formats can be found in the Data Dictionary at the top of each page. Templates populated with case metadata in both formats are displayed below.

{
    "type": "case",
    "submitter_id": "PROJECT-INTERNAL-000055",
    "projects": {
        "code": "INTERNAL"
    }
}
type  submitter_id  projects.code
case  PROJECT-INTERNAL-000055 INTERNAL   
Note: JSON and TSV formats handle links between entities (case and project) differently. JSON includes the code field nested within projects while TSV appends code to projects with a period.

Uploading the Case Submission File

The file detailed above can be uploaded using the Data Submission Portal and the API as described below:

Upload - Data Submission Portal

1. Upload Files

An example of a case upload is detailed below. The GDC Data Submission Portal is equipped with a wizard window to facilitate the upload and validation of entities. The Upload Data Wizard comprises two stages:

  • Upload Entities: Upload an entity into the user's browser, at this point nothing is submitted to the project workspace.
  • Validate Entities: Send an entity to the GDC backend to validate its content (see below)

The Validate Entities stage acts as a safeguard against submitting incorrectly formatted data to the GDC Data Submission Portal. During the validation stage, the GDC API will validate the content of uploaded entities against the Data Dictionary to detect potential errors. Invalid entities will not be processed and must be corrected by the user and re-uploaded before being accepted. A validation error report provided by the system can be used to isolate and correct errors.

Choosing 'UPLOAD' from the project dashboard will open the Upload Data Wizard.

GDC Submission Wizard Upload Files

Files containing one or more entities can be added either by clicking on 'CHOOSE FILE(S)' or using drag and drop. Files can be removed from the Upload Data Wizard by clicking on the garbage can icon next to the file.

2. Validate Entities

When the first file is added, the wizard will move to the 'VALIDATE' section and the user can continue to add files.

GDC Submission Wizard Validate Files

When all files have been added, choosing 'VALIDATE' will run a test to check if the entities are valid for submission.

3. Commit or Discard Files

If the upload contains valid entities, a new transaction will appear in the latest transactions panel with the option to 'COMMIT' or 'DISCARD' the data. Entities contained in these files can be committed (applied) to the project or discarded using these two buttons.

If the upload contains invalid files, a transaction will appear with a FAILED status. Invalid files will need to be either corrected and re-uploaded or removed from the submission. If more than one file is uploaded and at least one is not valid, the validation step will fail for all files.

Commit_Discard

Upload - API

The API has a much broader range of functionality than the Data Wizard. Entities can be created, updated, and deleted through the API. See the API Submission User Guide for a more detailed explanation and for the rest of the functionalities of the API. Generally, uploading an entity through the API can be performed using a command similar to the following:

curl --header "X-Auth-Token: $token" --request POST --data @CASE.json https://api.gdc.cancer.gov/v0/submission/GDC/INTERNAL/_dry_run?async=true
CASE.json is detailed below.

{
    "type": "case",
    "submitter_id": "PROJECT-INTERNAL-000055",
    "projects": {
        "code": "INTERNAL"
    }
}
Note: Submission of TSV files is also supported by the GDC API.

Next, the file can either be committed (applied to the project) through the Data Submission Portal as before, or another API query can be performed that will commit the file to the project. The transaction number in the URL (467) is printed to the console during the first step of API submission and can also be retrieved from the 'Transactions' tab in the Data Submission Portal.

curl --header "X-Auth-Token: $token" --request POST https://api.gdc.cancer.gov/v0/submission/GDC/INTERNAL/transactions/467/commit?async=true

Clinical Submission

GDC Data Model Clinical

Typically a submission project will include additional information about a case such as demographic, diagnosis, or exposure data.

Submitting a Demographic Entity to a Case

The demographic entity contains information that characterizes the case entity.

Submitting a Demographic entity requires:

  • submitter_id: A unique key to identify the demographic entity
  • cases.submitter_id: The unique key that was used for the case that links the demographic entity to the case
  • ethnicity: An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau
  • gender: Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles
  • race: An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau
  • year_of_birth: Numeric value to represent the calendar year in which an individual was born

{
    "type": "demographic",
    "submitter_id": "PROJECT-INTERNAL-000055-DEMOGRAPHIC-1",
    "cases": {
        "submitter_id": "PROJECT-INTERNAL-000055"
    },
    "ethnicity": "not hispanic or latino",
    "gender": "male",
    "race": "asian",
    "year_of_birth": 1946
}
type    cases.submitter_id  ethnicity   gender  race    year_of_birth
demographic PROJECT-INTERNAL-000055 not hispanic or latino  male    asian   1946

Submitting a Diagnosis Entity to a Case

Submitting a Diagnosis entity requires:

  • submitter_id: A unique key to identify the diagnosis entity
  • cases.submitter_id: The unique key that was used for the case that links the diagnosis entity to the case
  • age_at_diagnosis: Age at the time of diagnosis expressed in number of days since birth
  • classification_of_tumor: Text that describes the kind of disease present in the tumor specimen as related to a specific timepoint
  • days_to_last_follow_up: Time interval from the date of last follow up to the date of initial pathologic diagnosis, represented as a calculated number of days
  • days_to_last_known_disease_status: Time interval from the date of last follow up to the date of initial pathologic diagnosis, represented as a calculated number of days
  • days_to_recurrence: Time interval from the date of new tumor event including progression, recurrence and new primary malignancies to the date of initial pathologic diagnosis, represented as a calculated number of days
  • last_known_disease_status: The state or condition of an individual's neoplasm at a particular point in time
  • morphology: The third edition of the International Classification of Diseases for Oncology, published in 2000 used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms. The study of the structure of the cells and their arrangement to constitute tissues and, finally, the association among these to form organs. In pathology, the microscopic process of identifying normal and abnormal morphologic characteristics in tissues, by employing various cytochemical and immunocytochemical stains. A system of numbered categories for representation of data
  • primary_diagnosis: Text term for the structural pattern of cancer cells used to define a microscopic diagnosis
  • progression_or_recurrence: Yes/No/Unknown indicator to identify whether a patient has had a new tumor event after initial treatment
  • site_of_resection_or_biopsy: The third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms. The description of an anatomical region or of a body part. Named locations of, or within, the body. A system of numbered categories for representation of data
  • tissue_or_organ_of_origin: Text term that describes the anatomic site of the tumor or disease
  • tumor_grade: Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness
  • tumor_stage: The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body. The accepted values for tumor_stage depend on the tumor site, type, and accepted staging system. These items should accompany the tumor_stage value as associated metadata
  • vital_status: The survival state of the person registered on the protocol

{
    "type": "diagnosis",
    "submitter_id": "PROJECT-INTERNAL-000055-DIAGNOSIS-1",
    "cases": {
        "submitter_id": "GDC-INTERNAL-000099"
    },
    "age_at_diagnosis": 10256,
    "classification_of_tumor": "not reported",
    "days_to_last_follow_up": 34,
    "days_to_last_known_disease_status": 34,
    "days_to_recurrence": 45,
    "last_known_disease_status": "Tumor free",
    "morphology": "8260/3",
    "primary_diagnosis": "ACTH-producing tumor",
    "progression_or_recurrence": "no",
    "site_of_resection_or_biopsy": "Lung, NOS",
    "tissue_or_organ_of_origin": "Lung, NOS",
    "tumor_grade": "not reported",
    "tumor_stage": "stage i",
    "vital_status": "alive"
}
type    submitter_id    cases.submitter_id  age_at_diagnosis    classification_of_tumor days_to_last_follow_up  days_to_last_known_disease_status   days_to_recurrence  last_known_disease_status   morphology  primary_diagnosis   progression_or_recurrence   site_of_resection_or_biopsy tissue_or_organ_of_origin   tumor_grade tumor_stage vital_status
diagnosis   PROJECT-INTERNAL-000055-DIAGNOSIS-1 GDC-INTERNAL-000099 10256   not reported    34  34  45  Tumor free  8260/3  ACTH-producing tumor    no  Lung, NOS   Lung, NOS   not reported    stage i alive
Note: For information on submitting time-based data for patients over 90 years old see the GDC Submission Best Practices Guide.

Submitting an Exposure Entity to a Case

Submitting an Exposure entity does not require any information besides a link to the case and a submitter_id. The following fields are optionally included:

  • alcohol_history: A response to a question that asks whether the participant has consumed at least 12 drinks of any kind of alcoholic beverage in their lifetime
  • alcohol_intensity: Category to describe the patient's current level of alcohol use as self-reported by the patient
  • bmi: The body mass divided by the square of the body height expressed in units of kg/m^2
  • cigarettes_per_day: The average number of cigarettes smoked per day (number)
  • height: The height of the individual in cm (number)
  • weight: The weight of the individual in kg (number)
  • years_smoked: Numeric value (or unknown) to represent the number of years a person has been smoking

{
    "type": "exposure",
    "submitter_id": "PROJECT-INTERNAL-000055-EXPOSURE-1",
    "cases": {
        "submitter_id": "PROJECT-INTERNAL-000055"
    },
    "alcohol_history": "yes",
    "bmi": 27.5,
    "cigarettes_per_day": 20,
    "height": 190,
    "weight": 100,
    "years_smoked": 5
}
type    submitter_id    cases.submitter_id  alcohol_history bmi cigarettes_per_day  height  weight  years_smoked
exposure    PROJECT-INTERNAL-000055-EXPOSURE-1  PROJECT-INTERNAL-000055 yes 27.5    20  190 100 5
Note: Submitting a clinical entity uses the same conventions as submitting a case entity (detailed above).

Biospecimen Submission

Sample Submission

GDC Data Model 3

A sample submission has the same general structure as a case submission as it will require a unique key and a link to the case. However, sample entities require one additional value: sample_type. This peripheral data is required because it is necessary for the data to be interpreted. For example, an investigator using this data would need to know whether the sample came from tumor or normal tissue.

Dictionary Sample

Submitting a Sample entity requires:

  • submitter_id: A unique key to identify the sample
  • cases.submitter_id: The unique key that was used for the case that links the sample to the case
  • sample_type: Type of the sample. Named for its cellular source, molecular composition, and/or therapeutic treatment

Note: The case must be "committed" to the project before a sample can be linked to it. This also applies to all other links between entities.

{
    "type": "sample",
    "cases": {
        "submitter_id": "PROJECT-INTERNAL-000055"
    },
    "sample_type": "Blood Derived Normal",
    "submitter_id": "Blood-00001SAMPLE_55"
}
type    cases.submitter_id  submitter_id    sample_type
sample  PROJECT-INTERNAL-000055 Blood-00001SAMPLE_55    Blood Derived Normal  

Aliquot, Portion, and Analyte Submission

GDC Data Model 4

Submitting an Aliquot entity requires:

  • submitter_id: A unique key to identify the aliquot
  • analytes.submitter_id: The unique key that was used for the analyte that links the aliquot to the analyte

{
    "type": "aliquot",
    "submitter_id": "Blood-00021-aliquot55",
    "samples": {
        "submitter_id": "Blood-00001SAMPLE_55"
    }
}

type    submitter_id    analytes.submitter_id
aliquot Blood-00021-aliquot55   Blood-00001SAMPLE_55
Note: aliquot entities can be directly linked to sample entities. The portion and analyte entities described below are not required for submission.

Submitting a Portion entity requires:

  • submitter_id: A unique key to identify the portion
  • samples.submitter_id: The unique key that was used for the sample that links the portion to the sample

{
    "type": "portion",
    "submitter_id": "Blood-portion-000055",
    "samples": {
        "submitter_id": "Blood-00001SAMPLE_55"
    }
}

type    submitter_id    samples.submitter_id
portion Blood-portion-000055    Blood-00001SAMPLE_55
Submitting an Analyte entity requires:

  • submitter_id: A unique key to identify the analyte
  • portions.submitter_id: The unique key that was used for the portion that links the analyte to the portion
  • analyte_type: Protocol-specific molecular type of the specimen

{
    "type": "analyte",
    "portions": {
        "submitter_id": "Blood-portion-000055"
    },
    "analyte_type": "DNA",
    "submitter_id": "Blood-analyte-000055"
}

type    portions.submitter_id   analyte_type    submitter_id
analyte Blood-portion-000055    DNA Blood-analyte-000055

Read Group Submission

GDC Data Model 5

Because information about sequencing reads is necessary for downstream analysis, the read_group entity requires more fields than the other Biospecimen entities (sample, portion, analyte, aliquot).

Submitting a Read Group entity requires:

  • submitter_id: A unique key to identify the read_group
  • aliquot.submitter_id: The unique key that was used for the aliquot that links the read_group to the aliquot
  • experiment_name: Submitter-defined name for the experiment
  • is_paired_end: Are the reads paired end? (Boolean value: true or false)
  • library_name: Name of the library
  • library_strategy: Library strategy
  • platform: Name of the platform used to obtain data
  • read_group_name: The name of the read_group
  • read_length: The length of the reads (integer)
  • sequencing_center: Name of the center that provided the sequence files
  • library_selection: Library Selection Method
  • target_capture_kit: Description that can uniquely identify a target capture kit. Suggested value is a combination of vendor, kit name, and kit version.

{
    "type": "read_group",
    "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55",
    "experiment_name": "Resequencing",
    "is_paired_end": true,
    "library_name": "Solexa-34688",
    "library_strategy": "WXS",
    "platform": "Illumina",
    "read_group_name": "205DD.3-2",
    "read_length": 75,
    "sequencing_center": "BI",
    "library_selection": "Hybrid Selection",
    "target_capture_kit": "Custom MSK IMPACT Panel - 468 Genes",
    "aliquots":
        {
            "submitter_id": "Blood-00021-aliquot55"
        }    
}

type    submitter_id    experiment_name is_paired_end   library_name    library_strategy    platform    read_group_name read_length sequencing_center library_selection target_capture_kit  aliquots.submitter_id
read_group  Blood-00001-aliquot_lane1_barcodeACGTAC_55  Resequencing    true    Solexa-34688    WXS Illumina    205DD.3-2   75  BI  Hybrid Selection Custom MSK IMPACT Panel - 468 Genes Blood-00021-aliquot55
Note: Submitting a biospecimen entity uses the same conventions as submitting a case entity (detailed above).

Experiment Data Submission

GDC Data Model 6

Before the experiment data file can be submitted, the GDC requires that the user provides information about the file as a submittable_data_file entity. This includes file-specific data needed to validate the file and assess which analyses should be performed. Sequencing data files can be submitted as submitted_aligned_reads or submitted_unaligned_reads.

Submitting a Submitted Aligned-Reads entity requires:

  • submitter_id: A unique key to identify the submitted_aligned_reads
  • read_groups.submitter_id: The unique key that was used for the read_group that links the submitted_aligned_reads to the read_group
  • data_category: Broad categorization of the contents of the data file
  • data_format: Format of the data files
  • data_type: Specific content type of the data file. (must be "Aligned Reads")
  • experimental_strategy: The sequencing strategy used to generate the data file
  • file_name: The name (or part of a name) of a file (of any type)
  • file_size: The size of the data file (object) in bytes
  • md5sum: The 128-bit hash value expressed as a 32 digit hexadecimal number used as a file's digital fingerprint

{
    "type": "submitted_aligned_reads",
    "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam",
    "data_category": "Raw Sequencing Data",
    "data_format": "BAM",
    "data_type": "Aligned Reads",
    "experimental_strategy": "WGS",
    "file_name": "test.bam",
    "file_size": 38,
    "md5sum": "aa6e82d11ccd8452f813a15a6d84faf1",
    "read_groups": [
        {
            "submitter_id": "Primary_Tumor_RG_86-1"
        }
    ]
}
type    submitter_id    data_category   data_format data_type   experimental_strategy   file_name   file_size   md5sum  read_groups.submitter_id#1
submitted_aligned_reads Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam  Raw Sequencing Data BAM Aligned Reads   WGS test.bam    38  aa6e82d11ccd8452f813a15a6d84faf1    Primary_Tumor_RG_86-1
Submitting a Submitted Unaligned-Reads entity requires:

  • submitter_id: A unique key to identify the submitted_unaligned_reads
  • read_groups.submitter_id: The unique key that was used for the read_group that links the submitted_unaligned_reads to the read_group
  • data_category: Broad categorization of the contents of the data file
  • data_format: Format of the data files
  • data_type: Specific content type of the data file. (must be "Unaligned Reads")
  • experimental_strategy: The sequencing strategy used to generate the data file
  • file_name: The name (or part of a name) of a file (of any type)
  • file_size: The size of the data file (object) in bytes
  • md5sum: The 128-bit hash value expressed as a 32 digit hexadecimal number used as a file's digital fingerprint

{
    "type": "submitted_unaligned_reads",
    "submitter_id": "Blood-00001-aliquot_lane2_barcodeACGTAC_55.fastq",
    "data_category": "Raw Sequencing Data",
    "data_format": "FASTQ",
    "data_type": "Unaligned Reads",
    "experimental_strategy": "WGS",
    "file_name": "test.fastq",
    "file_size": 38,
    "md5sum": "901d48b862ea5c2bcdf376da82f2d22f",
    "read_groups": [
        {
            "submitter_id": "Primary_Tumor_RG_86-1"
        }
    ]
}
type    submitter_id    data_category   data_format data_type   experimental_strategy   file_name   file_size   md5sum  read_groups.submitter_id
submitted_unaligned_reads   Blood-00001-aliquot_lane2_barcodeACGTAC_55.fastq    Raw Sequencing Data FASTQ   Unaligned Reads WGS test.fastq  38  901d48b862ea5c2bcdf376da82f2d22f
Primary_Tumor_RG_86-1
Note: For details on submitting experiment data associated with more than one read_group entity, see the GDC Submission Best Practices Guide.

Uploading the Submittable Data File to the GDC

The submittable data file can be uploaded when it is registered with the GDC. An submittable data file is registered when its corresponding entity (e.g. submitted_unaligned_reads) is uploaded and committed. Uploading the file can be performed with either the GDC Data Transfer Tool or the API. Other types of data files such as clinical supplements, biospecimen supplements, and pathology reports are uploaded to the GDC in the same way. Supported data file formats are listed at the GDC Submitted Data Types and File Formats website.

GDC Data Transfer Tool: A file can be uploaded using its UUID (which can be retrieved from the portal or API) once it is registered. The following command can be used to upload the file:

gdc-client upload --project-id PROJECT-INTERNAL --identifier a053fad1-adc9-4f2d-8632-923579128985 -t $token -f $path_to_file
Additionally a manifest can be downloaded from the Submission Portal and passed to the Data Transfer Tool, this will allow for the upload of more than one submittable_data_file:

gdc-client upload -m manifest.yml -t $token
API Upload: A submittable_data_file can be uploaded through the API by using the /submission/program/project/files endpoint. The following command would be typically used to upload a file:

curl --request PUT --header "X-Auth-Token: $token" https://api.gdc.cancer.gov/v0/submission/PROJECT/INTERNAL/files/6d45f2a0-8161-42e3-97e6-e058ac18f3f3 -d@$path_to_file

For more details on how to upload a submittable_data_file to a project see the API Users Guide and the Data Transfer Tool Users Guide.

Metadata File Submission

GDC Data Model Metadata

The experiment_metadata entity contains information about the experiment that was performed to produce each read_group. Unlike the previous two entities outlined, only information about the experiment_metadata file itself (SRA XML) is applied to the entity (indexed) and the experiment_metadata file is submitted in the same way that a BAM file would be submitted.

Submitting an Experiment Metadata entity requires:

  • submitter_id: A unique key to identify the experiment_metadata entity
  • read_groups.submitter_id: The unique key that was used for the read_group that links the experiment_metadata entity to the read_group
  • data_category: Broad categorization of the contents of the data file
  • data_format: Format of the data files. (must be "SRA XML")
  • data_type: Specific contents of the data file. (must be "Experiment Metadata")
  • file_name: The name (or part of a name) of a file (of any type)
  • file_size: The size of the data file (object) in bytes
  • md5sum: The 128-bit hash value expressed as a 32 digit hexadecimal number used as a file's digital fingerprint

{
    "type": "experiment_metadata",
    "submitter_id": "Blood-001-aliquot_lane1_barcodeACGTAC_55-EXPERIMENT-1",
    "read_groups": {
        "submitter_id": "Primary_Tumor_RG_86-1"
    },
    "data_category": "Sequencing Data",
    "data_format": "SRA XML",
    "data_type": "Experiment Metadata",
    "file_name": "Experimental-data.xml",
    "file_size": 65498,
    "md5sum": "d79997e4de03b5a0311f0f2fe608c11d"
}
type    submitter_id    cases.submitter_id  data_category   data_format data_type   file_name   file_size   md5sum
experiment_metadata Blood-00001-aliquot_lane1_barcodeACGTAC_55-EXPERIMENT-1 Primary_Tumor_RG_86-1   Sequencing Data SRA XML Experiment Metadata Experimental-data.xml   65498   d79997e4de03b5a0311f0f2fe608c11d

Annotation Submission

The GDC Data Portal supports the use of annotations for any submitted entity or file. An annotation entity may include comments about why particular patients or samples are not present or why they may exhibit critical differences from others. Annotations include information that cannot be submitted to the GDC through other existing nodes or properties.

If a submitter would like to create an annotation please contact the GDC Support Team (support@nci-gdc.datacommons.io).

Deleting Submitted Entities

The GDC Data Submission Portal allows users to delete submitted entities from the project when the project is in an "OPEN" state. Files cannot be deleted while in the 'SUBMITTED' stated. This section applies to entities that have been committed to the project. Entities that have not been committed can be removed from the project by choosing the DISCARD button. Entities can also be deleted using the API. See the API Submission Documentation for specific instructions.

NOTE: Entities associated with files uploaded to the GDC object store cannot be deleted until the associated file has been deleted. Users must utilize the GDC Data Transfer Tool to delete these files first.

Simple Deletion

If an entity was uploaded and has no related entities, it can be deleted from the Browse tab. Once the entity to be deleted is selected, choose the DELETE button in the right panel under "ACTIONS".


GDC Delete Unassociated Case


A message will then appear asking if you are sure about deleting the entity. Choosing the YES, DELETE button will remove the entity from the project, whereas choosing the NO, CANCEL button will return the user to the previous screen.


GDC Yes or No


Deletion with Dependents

If an entity has related entities, such as a case with multiple samples and aliquots, deletion takes one extra step.


GDC Delete Associated Case


Follow the 'Simple Deletion' method until the end. This action will appear in the Transactions tab as "Delete" with a "FAILED" state.


GDC Delete Failed


Choose the failed transaction and the right panel will show the list of entities related to the entity that was going to be deleted.


GDC Error Related


Selecting the DELETE ALL button at the bottom of the list will delete all of the related entities, their descendants, and the original entity.

Submitted Data File Deletion

The submittable_data_file that were uploaded erroneously are deleted separately from their associated entity using the GDC Data Transfer Tool. See the section on Deleting Data Files in the Data Transfer Tool users guide for specific instructions.

Updating Uploaded Entities

Before harmonization occurs, entities can be modified to update, add, or delete information. Below, these methods are outlined.

Updating or Adding Fields

Updated or additional fields can applied to entities by reuploading them through the submission portal or API. See below for an example of a case upload with a primary_site field being added and a disease_type field being updated.

Existing Entity:

{
"type":"case",
"submitter_id":"GDC-INTERNAL-000043",
"projects":{
  "code":"INTERNAL"
},
"disease_type": "Neuroblastoma"
}
The following entity would be submitted to update the existing one:

{
"type":"case",
"submitter_id":"GDC-INTERNAL-000043",
"projects":{
  "code":"INTERNAL"
},
"disease_type": "Germ Cell Neoplasms",
"primary_site": "Pancreas"
}
Guidelines:

  • The newly uploaded entity must contain the submitter_id of the existing entity so that the system updates the correct one.
  • All newly updated entities will be validated by the GDC Dictionary. All required fields must be present in the newly updated entity.
  • Fields that are not required do not need to be re-uploaded and will remain unchanged in the entity unless they are updated.

Deleting Optional Fields

It may be necessary to delete fields from uploaded entities. This can be performed through the API and can only be applied to optional fields. It also requires the UUID of the entity, which can be retrieved from the submission portal or using a GraphQL query.

In the example below, the primary_site and disease_type fields are removed from a case entity:

curl --header "X-Auth-Token: $token_string" --request DELETE  --header "Content-Type: application/json" "https://api.gdc.cancer.gov/v0/submission/EXAMPLE/PROJECT/entities/7aab7578-34ff-5651-89bb-57aefdc4c4f8?fields=primary_site,disease_type"
{
"type":"case",
"submitter_id":"GDC-INTERNAL-000043",
"projects":{
  "code":"INTERNAL"
},
"disease_type": "Germ Cell Neoplasms",
"primary_site": "Pancreas"
}
{
"type":"case",
"submitter_id":"GDC-INTERNAL-000043",
"projects":{
  "code":"INTERNAL"
}
}

Strategies for Submitting in Bulk

Each submission in the previous sections was broken down by component to demonstrate the GDC Data Model structure. However, the submission of multiple entities at once is supported and encouraged. Here two strategies for submitting data in an efficient manner are discussed.

Registering a BAM File: One Step

Registering a BAM file (or any other type) can be performed in one step by including all of the entities, from case to submitted_aligned_reads, in one file. See the example below:

[{
    "type": "case",
    "submitter_id": "PROJECT-INTERNAL-000055",
    "projects": {
        "code": "INTERNAL"
    }
},
{
    "type": "sample",
    "cases": {
        "submitter_id": "PROJECT-INTERNAL-000055"
    },
    "sample_type": "Blood Derived Normal",
    "submitter_id": "Blood-00001_55"
},
{
    "type": "portion",
    "submitter_id": "Blood-portion-000055",
    "samples": {
        "submitter_id": "Blood-00001_55"
    }
},
{
    "type": "analyte",
    "portions": {
        "submitter_id": "Blood-portion-000055"
    },
    "analyte_type": "DNA",
    "submitter_id": "Blood-analyte-000055"
},
{
    "type": "aliquot",
    "submitter_id": "Blood-00021-aliquot55",
    "analytes": {
        "submitter_id": "Blood-analyte-000055"
    }
},
{
    "type": "read_group",
    "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55",
    "experiment_name": "Resequencing",
    "is_paired_end": true,
    "library_name": "Solexa-34688",
    "library_strategy": "WXS",
    "platform": "Illumina",
    "read_group_name": "205DD.3-2",
    "read_length": 75,
    "sequencing_center": "BI",
    "aliquots":
        {
            "submitter_id": "Blood-00021-aliquot55"
        }    
},
{
    "type": "submitted_aligned_reads",
    "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55.bam",
    "data_category": "Raw Sequencing Data",
    "data_format": "BAM",
    "data_type": "Aligned Reads",
    "experimental_strategy": "WGS",
    "file_name": "test.bam",
    "file_size": 38,
    "md5sum": "aa6e82d11ccd8452f813a15a6d84faf1",
    "read_groups": [
        {
            "submitter_id": "Blood-00001-aliquot_lane1_barcodeACGTAC_55"
        }
    ]
}]
All of the entities are placed into a JSON list object:

[{"type": "case","submitter_id": "PROJECT-INTERNAL-000055","projects": {"code": "INTERNAL"}}}, entity-2, entity-3]

The entities need not be in any particular order as they are validated together.

Note: Tab-delimited format is not recommended for 'one-step' submissions due to an inability of the format to accommodate multiple 'types' in one row.

Submitting Numerous Cases

The GDC understands that submitters will have projects that comprise more entities than would be reasonable to individually parse into JSON formatted files. Additionally, many investigators store large amounts of data in a tab-delimited format (TSV). For instances like this, we recommend parsing all entities of the same type into separate TSVs and submitting them on a type-basis.

For example, a user may want to submit 100 Cases associated with 100 samples, 100 portions, 100 analytes, 100 aliquots, and 100 read_groups. Constructing and submitting 100 JSON files would be tedious and difficult to organize. Submitting one case TSV containing 100 cases, one sample TSV containing 100 samples, and the rest would require six TSVs and can be formatted in programs such as Microsoft Excel or Google Spreadsheets.

See the following example TSV files:

Download Previously Uploaded Metadata Files

The transaction page lists all previous transactions in the project. The user can download metadata files uploaded to the GDC workspace in the details section of the screen by selecting one transaction and scrolling to the "DOCUMENTS" section.

Transaction Original Files

Download Previously Uploaded Data Files

The only supported method to download data files previously uploaded to the GDC Submission Portal that have not been release yet is to use the API or the Data Transfer Tool. To retrieve data previous upload to the submission portal you will need to retrieve the data file's UUID. The UUIDs for submitted data files are located in the submission portal under the file's Summary section.

Submission Portal Summary View
The UUIDs can also be found in the manifest file accessible from the manifest button located on the file's Summary page.

Once the UUID(s) have been retrieved the download process is the same as it is for data on our main portal. If the process is unfamiliar please refer to the Downloading data using GDC file UUIDs for downloading instructions.

Note: When submittable data files are uploaded through the Data Transfer Tool they are not displayed as transactions.