Data Release Notes

Version Date
v13.0 September 27, 2018
v12.0 June 13, 2018
v11.0 May 21, 2018
v10.1 February 15, 2018
v10.0 December 21, 2017
v9.0 October 24, 2017
v8.0 August 22, 2017
v7.0 June 29, 2017
v6.0 May 9, 2017
v5.0 March 16, 2017
v4.0 October 31, 2016
v3.0 September 16, 2016
v2.0 August 9, 2016
v1.0 June 6, 2016

Data Release 13.0

  • GDC Product: Data
  • Release Date: September 27, 2018

New updates

  1. Three new projects are released to the GDC (VAREPOP-APOLLO (phs001374), CTSP-DLBCL1 (phs001184), NCICCR-DLBCL (phs001444)
  2. TARGET WGS alignments are released. VCFs will be provided in a later release
  3. Clinical data was harmonized with ICD-O-3 terminology for TCGA properties case.primary_site, case.disease_type, diagnosis.primary_diagnosis, diagnosis.site_of_resection_or_biopsy, diagnosis.tissue_or_organ_of_origin
  4. Redaction annotations applied to 11 aliquots in TCGA-DLBC
  5. Redaction annotations applied to incorrectly trimmed miRNA file in the Legacy Achive

A complete list of files for DR13.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

  • 253 files Copy Number Segment and Masked Copy Number Segment files were released. These were skipped in DR 12.0
  • 36 Diagnostic TCGA slides were released. They were skipped in DR 12.0

Known Issues and Workarounds

  • 506 Copy Number Segment and 36 Slide Image files are designated as controlled-access on the GDC Data Portal. These files are actually open-access and will be downloadable without a token using this manifest.
  • 2 Copy Number Segment files from TCGA-TGCT do not appear on the GDC Portal. They can be downloaded using the Data Transfer Tool using the following UUIDs.
    • 6cd4ef5e-324a-4ace-8779-7a33bd559c83 - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.nocnv_grch38.seg.v2.txt
    • dfa89ee9-6ee5-460b-bd58-b5ca0e9cb7ac - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.grch38.seg.v2.txt
  • TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
    • TARGET-20-PASJGZ-04A-02D
    • TARGET-30-PAPTLY-01A-01D
    • TARGET-20-PAEIKD-09A-01D
    • TARGET-20-PASMYS-14A-02D
    • TARGET-20-PAMYAS-14A-02D
    • TARGET-10-PAPZST-09A-01D
  • Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
  • 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
  • 11 bam files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
  • Two tissue slide images are unavailable for download from GDC Data Portal
  • The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
  • There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  • There are two cases with identical submitter_id TARGET-10-PARUYU
  • TARGET-MDLS cases do not have disease_type or primary_site populated
  • Some TARGET cases are missing days_to_last_follow_up
  • Some TARGET cases are missing age_at_diagnosis
  • Some TARGET files are not connected to all related aliquots
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 12.0

  • GDC Product: Data
  • Release Date: June 13, 2018

New updates

  1. Updated clinical and biospecimen XML files for TCGA cases are available in the GDC Data Portal. Equivalent Legacy Archive files may no longer be up to date.
  2. All biospecimen and clinical supplement files for TCGA projects formerly only found in the Legacy Archive have been updated and transferred to the GDC Data Portal. Equivalent Legacy Archive files and metadata retrieved from the API may no longer be up to date.
  3. Diagnostic slides from TCGA are now available in the GDC Data Portal and Slide Image Viewer. They were formerly only available in the Legacy Archive.
  4. Updated Copy Number Segment and Masked Copy Number Segment files are now available. These were generated using an improved mapping of hg38 coordinates for the Affymetrix SNP6.0 probe set.
  5. VCF files containing SNVs produced from TARGET WGS CGI data are available. The variant calls were initially produced by CGI and lifted over to hg38.

Updated files for this release are listed here. A complete list of files for DR12.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

  • TARGET NBL RNA-Seq data is now associated with the correct aliquot.

Known Issues and Workarounds

  • Some Copy Number Segment and Masked Copy Number Segment were not replaced in DR 12.0. 253 files remain to be swapped in a later release
  • Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
  • 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
  • 36 Diagnostic TCGA slides are not yet available in the active GDC Portal. They are still available in the GDC Legacy Archive.
  • 11 bam files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
  • Two tissue slide images are unavailable for download from GDC Data Portal
  • The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
  • There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  • There are two cases with identical submitter_id TARGET-10-PARUYU
  • TARGET-MDLS cases do not have disease_type or primary_site populated
  • Some TARGET cases are missing days_to_last_follow_up
  • Some TARGET cases are missing age_at_diagnosis
  • Some TARGET files are not connected to all related aliquots
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 11.0

  • GDC Product: Data
  • Release Date: May 21, 2018

New updates

  1. Updated miRNA files to remove QCFail reads. This included all BAM and downstream count files.
  2. TCGA Tissue slide images now available in GDC Data Portal. Previously these were found only in the Legacy Archive

Updated files for this release are listed here. A complete list of files for DR11.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

  • N/A

Known Issues and Workarounds

  • Two tissue slide images are unavailable for download from GDC Data Portal
  • RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
  • The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
  • There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  • There are two cases with identical submitter_id TARGET-10-PARUYU
  • TARGET-MDLS cases do not have disease_type or primary_site populated
  • Some TARGET cases are missing days_to_last_follow_up
  • Some TARGET cases are missing age_at_diagnosis
  • Some TARGET files are not connected to all related aliquots
  • miRNA alignments include QC failed reads.
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 10.1

  • GDC Product: Data
  • Release Date: February 15, 2018

New updates

  1. Updated FM-AD clinical data to conform with Data Dictionary release v1.11

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
  • The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
  • There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  • There are two cases with identical submitter_id TARGET-10-PARUYU
  • TARGET-MDLS cases do not have disease_type or primary_site populated
  • Some TARGET cases are missing days_to_last_follow_up
  • Some TARGET cases are missing age_at_diagnosis
  • Some TARGET files are not connected to all related aliquots
  • miRNA alignments include QC failed reads.
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 10.0

  • GDC Product: Data
  • Release Date: December 21, 2017

New updates

  1. New TARGET files for all projects
  2. TARGET updates for clinical and biospecimen data
  3. Replace corrupted .bai files
  4. Update TCGA and TARGET MAF files to include VarScan2 indels and more information in all_effects column
  5. Update VarScan VCF files

Updated files for this release are listed here. A complete list of files for DR10.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
  • There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  • There are two cases with identical submitter_id TARGET-10-PARUYU
  • TARGET-MDLS cases do not have disease_type or primary_site populated
  • Some TARGET cases are missing days_to_last_follow_up
  • Some TARGET cases are missing age_at_diagnosis
  • Some TARGET files are not connected to all related aliquots
  • miRNA alignments include QC failed reads.
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 9.0

  • GDC Product: Data
  • Release Date: October 24, 2017

New updates

  1. Foundation Medicine Data Release
  2. This includes controlled-access VCF and MAF files as well as clinical and biospecimen supplements and metadata.
  3. Original Foundation Medicine supplied data can be found on the Foundation Medicine Project Page.
  4. Updated RNA-Seq data for TARGET NBL
  5. Includes new BAM and count files

Updated files for this release are listed here. A complete list of files for DR9.0 are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • miRNA alignments include QC failed reads.
  • Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  • FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 8.0

  • GDC Product: Data
  • Release Date: August 22, 2017

New updates

  1. Released updated miRNA quantification files to address double counting of some normalized counts described in DR7.0 release notes.

Updated files for this release are listed here. A Complete list of files for DR8.0 are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MDLS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 7.0

  • GDC Product: Data
  • Release Date: June 29, 2017

New updates

  1. Updated public Mutation Annotation Format (MAF) files are now available. Updates include filtering to remove variants impacted by OxoG artifacts and those impacted by strand bias.
  2. Protected MAF files are updated to include flags for OxoG and strand bias.
  3. Annotated VCFs are updated to include flags for OxoG artifacts and strand bias.

Updated files for this release are listed here. A Complete list of files for DR7.0 are listed here

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
  • Reads that are mapped to multiple genomic locations are double counted in some of the GDC miRNA results. The GDC will release updated files correcting the issue in an upcoming release. The specific impacts are described further below:
    • Isoform Expression Quantification files
      • Raw reads counts are accurate
      • Normalized counts are proportionally skewed (r^2=1.0)
    • miRNA Expression Quantification files
      • A small proportion of miRNA counts are overestimated (mean r^2=0.9999)
      • Normalized counts are proportionally skewed (mean r^2=0.9999)
    • miRNA BAM files
      • no impact
  • Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
  • Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
  • The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    • All microarray data and metadata
    • All sequencing analyzed data and metadata
    • 1180 of 12063 sequencing runs of raw data
  • Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MLDS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Data Release 6.0

  • GDC Product: Data
  • Release Date: May 9, 2017

New updates

  1. GDC updated public Mutation Annotation Format (MAF) files are now available. Updates include leveraging the MC3 variant filtering strategy, which results in more variants being recovered relative to the previous version. A detailed description of the new format can be found here.
  2. Protected MAFs are updated to include additional variant annotation information
  3. Some MuTect2 VCFs updated to include dbSNP and COSMIC annotations found in other VCFs

Updated files for this release are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

  • There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
    • TCGA-FF-8062
    • TCGA-FM-8000
    • TCGA-G8-6324
    • TCGA-G8-6325
    • TCGA-G8-6326
    • TCGA-G8-6906
    • TCGA-G8-6907
    • TCGA-G8-6909
    • TCGA-G8-6914
    • TCGA-GR-7351
    • TCGA-GR-7353
  • Variants found in VCF and MAF files may contain OxoG artifacts, which are produced during library preparation and may result in the apparent substitutions of C to A or G to T in certain sequence contexts. In the future we will plan to label potential oxoG artifacts in the MAF files.
  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Some validated somatic mutations may not be present in open-access MAF files. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • No data from TARGET-MLDS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Details are provided in Data Release Manifest

Data Release 5.0

  • GDC Product: Data
  • Release Date: March 16, 2017

New updates

  1. Additional annotations from TCGA DCC are available
    • Complete list of updated TCGA files is found here
  2. Clinical data added for TARGET ALL P1 and P2
  3. Pathology reports now have submitter IDs as assigned by the BCR
  4. TARGET Data refresh
    • Most recent biospecimen and clinical information from the TARGET DCC. New imported files are listed here
    • Updated indexed biospecimen and clinical metadata
    • Updated SRA XMLs files
    • Does not include updates to TARGET NBL

Bugs Fixed Since Last Release

  1. Missing cases from TCGA-LAML were added to Legacy Archive
  2. Biotab files are now linked to Projects and Cases in Legacy Archive

Known Issues and Workarounds

  • Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  • Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in MC3 or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
  • TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
  • No data from TARGET-MLDS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • Two biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
  • Tumor grade property is not populated
  • Progression_or_recurrence property is not populated

Details are provided in Data Release Manifest

Data Release 4.0

  • GDC Product: Data
  • Release Date: October 31, 2016

New updates

  1. TARGET ALL P1 and P2 biospecimen and molecular data are now available in the Legacy Archive. Clinical data will be available in a later release.
  2. Methylation data from 27k/450k Arrays has been lifted over to hg38 and is now available in the GDC Data Portal
  3. Public MAF files are now available for VarScan2, MuSE, and SomaticSniper. MuTect2 MAFs were made available in a previous release.
  4. Updated VCFs and MAF files are available for MuTect2 pipeline to compensate for WGA-related false positive indels. See additional information on that change here. A listing of replaced files is provided here.
  5. Added submitter_id for Pathology Reports in Legacy Archive

Bugs Fixed Since Last Release

  • None

Known Issues and Workarounds

  • Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in COSMIC or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
  • Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
  • TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
  • No data from TARGET-MLDS is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
  • Biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Data Release 3.0

  • GDC Product: Data
  • Release Date: September 16, 2016

New updates

  1. CCLE data now available (in the Legacy Archive only)
  2. BMI calculation is corrected
  3. Slide is now categorized as a Biospecimen entity

Bugs Fixed Since Last Release

  • BMI calculation is corrected

Known Issues and Workarounds

  • Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. Whether a sample underwent this process can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
  • MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
  • TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
  • No data from TARGET-PPTP is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
  • Biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Data Release 2.0

  • GDC Product: Data
  • Release Date: August 9, 2016

New updates

  1. Additional data, previously available via CGHub and the TCGA DCC, is now available in the GDC
  2. Better linking between files and their associated projects and cases in the Legacy Archive
  3. MAF files are now available in the GDC Data Portal

Known Issues and Workarounds

  • Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. These are present in VCF and MAF files produced by the MuTect2 variant calling pipeline. This information can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
  • MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
  • TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
  • No data from TARGET-PPTP is available.
  • Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  • SDF Files are not linked to Project or Case in the Legacy Archive
  • There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
  • Biotab files are not linked to Project or Case in the Legacy Archive
  • SDRF files are not linked to Project or Case in the Legacy Archive
  • Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Initial Data Release (1.0)

  • GDC Product: Data
  • Release Date: June 6, 2016

Available Program Data

  • The Cancer Genome Atlas (TCGA)
  • Therapeutically Applicable Research To Generate Effective Treatments (TARGET)

Available Harmonized Data

  • WXS
    • Co-cleaned BAM files aligned to GRCh38 using BWA
  • mRNA-Seq
    • BAM files aligned to GRCh38 using STAR 2-pass strategy
    • Expression quantification using HTSeq
  • miRNA-Seq
    • BAM files aligned to GRCh38 using BWA aln
    • Expression quantification using BCCA miRNA Profiling Pipeline*
  • Genotyping Array
    • CNV segmentation data

Known Issues and Workarounds

  • BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
  • All legacy files for TCGA are available in the GDC Legacy Archive, but not always linked back to cases depending on available metadata.
  • Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
  • TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
  • No data from TARGET-PPTP is available.
  • Legacy data not available in harmonized form:
    • Annotated VCF files from TARGET, anticipated in future data release
    • TCGA data that failed harmonization or QC or have been newly updated in CGHub: ~1.0% of WXS aliquots, ~1.6% of RNA-Seq aliquots
    • TARGET data that failed harmonization or QC, have been newly updated in CGHub, or whose project names are undergoing reorganization: ~76% of WXS aliquots, ~49% of RNA-Seq aliquots, ~57% of miRNA-Seq.
  • MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
  • MAFs are not yet available for query or search in the GDC Data Portal or API. You may download these files using the following manifests, which can be passed directly to the Data Transfer Tool. Links for the open-access TCGA MAFs are provided below for downloading individual files.

Details are provided in Data Release Manifest

Download Open-access MAF files

TCGA.ACC.mutect.somatic.maf.gz
TCGA.BLCA.mutect.somatic.maf.gz
TCGA.BRCA.mutect.somatic.maf.gz
TCGA.CESC.mutect.somatic.maf.gz
TCGA.CHOL.mutect.somatic.maf.gz
TCGA.COAD.mutect.somatic.maf.gz
TCGA.DLBC.mutect.somatic.maf.gz
TCGA.ESCA.mutect.somatic.maf.gz
TCGA.GBM.mutect.somatic.maf.gz
TCGA.HNSC.mutect.somatic.maf.gz
TCGA.KICH.mutect.somatic.maf.gz
TCGA.KIRC.mutect.somatic.maf.gz
TCGA.KIRP.mutect.somatic.maf.gz
TCGA.LAML.mutect.somatic.maf.gz
TCGA.LGG.mutect.somatic.maf.gz
TCGA.LIHC.mutect.somatic.maf.gz
TCGA.LUAD.mutect.somatic.maf.gz
TCGA.LUSC.mutect.somatic.maf.gz
TCGA.MESO.mutect.somatic.maf.gz
TCGA.OV.mutect.somatic.maf.gz
TCGA.PAAD.mutect.somatic.maf.gz
TCGA.PCPG.mutect.somatic.maf.gz
TCGA.PRAD.mutect.somatic.maf.gz
TCGA.READ.mutect.somatic.maf.gz
TCGA.SARC.mutect.somatic.maf.gz
TCGA.SKCM.mutect.somatic.maf.gz
TCGA.STAD.mutect.somatic.maf.gz
TCGA.TGCT.mutect.somatic.maf.gz
TCGA.THCA.mutect.somatic.maf.gz
TCGA.THYM.mutect.somatic.maf.gz
TCGA.UCEC.mutect.somatic.maf.gz
TCGA.UCS.mutect.somatic.maf.gz
TCGA.UVM.mutect.somatic.maf.gz