Data Release Notes

Version	Date
v45.0	December 4, 2025
v44.0	October 29, 2025
v43.0	May 7, 2025
v42.0	January 30, 2025
v41.0	August 28, 2024
v40.0	March 29, 2024
v39.0	December 4, 2023
v38.0	August 31, 2023
v37.0	March 29, 2023
v36.0	December 12, 2022
v35.0	September 28, 2022
v34.0	July 27, 2022
v33.1	May 31, 2022
v33.0	May 3, 2022
v32.0	March 29, 2022
v31.0	October 29, 2021
v30.0	September 23, 2021
v29.0	March 31, 2021
v28.0	February 2, 2021
v27.0-fix	November 9, 2020
v27.0	October 29, 2020
v26.0	September 8, 2020
v25.0	July 22, 2020
v24.0	May 7, 2020
v23.0	April 7, 2020
v22.0	January 16, 2020
v21.0	December 10, 2019
v20.0	November 11, 2019
v19.1	November 6, 2019
v19.0	September 17, 2019
v18.0	July 8, 2019
v17.1	June 12, 2019
v17.0	June 5, 2019
v16.0	March 26, 2019
v15.0	February 20, 2019
v14.0	December 18, 2018
v13.0	September 27, 2018
v12.0	June 13, 2018
v11.0	May 21, 2018
v10.1	February 15, 2018
v10.0	December 21, 2017
v9.0	October 24, 2017
v8.0	August 22, 2017
v7.0	June 29, 2017
v6.0	May 9, 2017
v5.0	March 16, 2017
v4.0	October 31, 2016
v3.0	September 16, 2016
v2.0	August 9, 2016
v1.0	June 6, 2016

Data Release 45.0

GDC Product: Data
Release Date: December 4, 2025

New Updates

New Projects
- ALCHEMIST-ALCH: Adjuvant Lung Cancer Enrichment Marker Identification and Sequencing Trial (dbGaP phs001140)
  - Includes data from WXS, WGS, RNA-Seq, miRNA-Seq, and slide images
- CCG-CUPP: Center for Cancer Genomics Cancers of Unknown Primary Project (dbGaP phs001801)
  - Includes data from WXS, WGS, RNA-Seq, miRNA-Seq, and methylation array
- RC-PTCL: Refractory Cancers - Peripheral T-Cell Lymphoma (dbGaP phs002097)
  - Includes data from WXS, WGS, RNA-Seq, miRNA-Seq, and methylation array
New Data Sets
- All of the CGCI-BLGSP slide images that were in JPEG2000 format are now available in SVS format
- One case from HCMI-CMDC, HCM-BROD-1181-C73, was released along with its associated files
Data Updates
- The remainder of the tumor purity and tumor ploidy values have been migrated to the correct copy number segment file from the AscatNGS pipeline.

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

The subset of the GATK4 MuTect2 raw VCF files that were not replaced in DR44 have been replaced in this release.
All RNA-Seq files from TCGA-OV are now marked according to their correct access level.
Two WGS alignments from APOLLO-OV that were previously unavailable are now available on the data portal.

Known Issues and Workarounds

All annotated WGS VCFs from GATK4 MuTect2 do not have variants from chromosomes 10 and 20. However, their corresponding raw VCFs do have variants from the missing two chromosomes. This will be addressed in a future data release.
146 cases from the CCDI-MCI project do not have diagnosis information associated with them. This data will be added in a future release.
Some VCF headers from SvABA list the names of the BAM files they originated from instead of "NORMAL" and "TUMOR", in that order.
The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.

Data Release 44.0

GDC Product: Data
Release Date: October 29, 2025

New Updates

New Projects
- APOLLO-OV: APOLLO2: Proteogenomic Characterization of Ovarian Serous Cystadenocarcinoma (dbGaP phs003488)
  - Includes RNA-Seq and WGS data
- CCDI-MCI: Childhood Cancer Data Initiative (CCDI) Molecular Characterization Initiative (MCI) (dbGaP phs002790)
  - Includes Methylation and WXS alignments
New Data Sets
- New WGS variants from GATK4 MuTect2 and VarScan2 are available
- 175 new cases from HCMI-CMDC have been released
- 338 new cases from CPTAC-3 have been released
- New RNA-Seq aliquots from:
  - MATCH-P (1 aliquot)
  - TCGA-LUAD (1 aliquot)
  - TCGA-OV (5 aliquots)
Data Updates
- Two partial slide images (from TCGA-BRCA and TCGA-KIRP) were restored as full images
- 250 slide images from TCGA-TGCT have been added

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

Most GATK4 MuTect2 raw VCF files have been replaced by new versions that do contain variants from chromosomes 10 and 20. A subset were not replaced and can be found in this manifest. They will be versioned in a future release.
The GATK4 MuTect2 annotated VCF files that appeared on the portal but were not available for download no longer appear on the portal.
410 TCGA-TGCT slide images have been added to restore the ones that were erroneously deleted.

Known Issues and Workarounds

One RNA-Seq file from TCGA-OV (Data Type: Gene Expression Quantification) is marked as controlled-access and is not downloadable from the portal without dbGaP access. Click here to download it as an open-access file.
146 cases from the CCDI-MCI project do not have diagnosis information associated with them. This data will be added in a future release.
Some VCF headers from SvABA list the names of the BAM files they originated from instead of "NORMAL" and "TUMOR", in that order.
The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.

Data Release 43.0

GDC Product: Data
Release Date: May 7, 2025

New Updates

New Data Sets
- Release of new WGS variant calling workflows for existing WGS tumor normal pairs. This includes data from the new WGS VarScan2 pipeline.
  - Note: Variants from pipelines were released from aliquots pairs on a pipeline-basis. Aliquot pairs may have data from some pipelines but not others. The completed set will be made available in future releases.
- 351 new cases from HCMI have been released along with their associated data files
Data Updates
- The AscatNGS files that were produced using the Sanger WGS pipeline now have a Data Type value of Allele-specific Copy Number Segment
- Alignments that were produced before the BAM QC pipeline was implemented now have their BAM QC values populated
- Clinical data for the CPTAC-3 HNSCC cases has been updated and these changes are reflected in the Data Portal and API

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

Survival plots are generated from the diagnoses.days_to_last_follow_up field. For some TCGA projects, in Data Release 42, data was migrated to the follow_ups.days_to_follow_up field. This data has been migrated back to diagnoses.days_to_last_follow_up and should now appear in the survival plots.
The slide image TCGA-A8-A06U-01A-01-TS1.63824040-373f-4c6c-a74e-881c127567a6.svs only contained a partial slide. Though it is still present in the data portal, the replacement slide TCGA-A8-A06U-01A-01-TS1.DFD8B445-C7A5-4247-A39F-222591C6D7E2.svs has been added to the portal.

Known Issues and Workarounds

A subset of the GATK4 MuTect2 annotated VCF files that were missing variants from chromosomes 10 and 20 appear on the portal but are not available for download. Replacements for these files will be made available in a future release. See the attached list.
- Note that if you add one of these files to your cart, along with available files you do have access to, the download will not work and will display an error: "You don't have access to this resource: Requested file does not allow read access". Removing the unavailable files will resolve this issue.
VCF files that were produced by GATK4 MuTect2 are missing all variants from chromosomes 10 and 20. New VCF files with a complete set of chromosomes will be available in a future data release. This includes any file with the following workflow types: GATK4 MuTect2, GATK4 MuTect2 Annotation,GATK4 MuTect2 Tumor-Only, and GATK4 MuTect2 Tumor-Only Annotation.
Some VCF headers from SvABA list the names of the BAM files they originated from instead of "NORMAL" and "TUMOR", in that order.
The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.

Data Release 42.0

GDC Product: Data
Release Date: January 30, 2025

New Updates

New Data Sets
- Release of new WGS variant calling workflows for existing WGS tumor normal pairs. See the documentation on WGS variant calling for more details on the available files. This includes data from the following pipelines:
  - GATK4 MuTect2 - SNVs (raw and annotated VCFs)
  - SvABA Indel - SNVs (raw and annotated VCFs)
  - SvABA - Structural variants (VCF and BEDPE)
  - Manta - Structural variants (VCF and BEDPE)
  - GATK4 CNV - Copy number segments and auxiliary files
  - Note: Variants from pipelines were released from aliquots pairs on a pipeline-basis. Aliquot pairs may have data from some pipelines but not others. The completed set will be made available in future releases.
- TCGA WGS alignments
  - TCGA-BLCA - 9 alignments
  - TCGA-BRCA - 1 alignment
  - TCGA-COAD - 54 alignments
  - TCGA-HNSC - 5 alignments
  - TCGA-KIRC - 837 alignments
  - TCGA-KIRP - 6 alignments
  - TCGA-LGG - 2 alignments
  - TCGA-LUAD - 175 alignments
  - TCGA-LUSC - 12 alignments
  - TCGA-OV - 8 alignments
  - TCGA-SKCM - 624 alignments
- New clinical data for all TCGA projects: Fields that were previously only available in clinical supplement files are now queryable and downloadable through the GDC Data Portal and API
- TCGA-GBM miRNA-Seq - 265 aliquots
- TCGA-LUSC miRNA-Seq - 10 aliqiuots
- TCGA-OV miRNA-Seq - 76 aliquots
- TARGET-AML RNA-Seq - 46 aliquots
- TCGA-GBM RNA-Seq - 215 aliquots
- TCGA-LUSC RNA-Seq - 9 aliquots
- TARGET-NBL WGS - 146 raw CGI variants released
- New versions of 31 CTSP-DLBCL1 clinical supplements
Data Updates
- Tumor purity and tumor ploidy properties were migrated from the aligned reads node to the copy number segment node

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

Pathology reports now have associated case/biospecimen information in the portal.
Several bugs were fixed with the inclusion of TCGA clinical data in the API.

Known Issues and Workarounds

Some VCF headers from SvABA list the names of the BAM files they originated from instead of "NORMAL" and "TUMOR", in that order.
The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.

DR42 Data Notes

Survival Plot:
- Survival plots are generated from the diagnoses.days_to_last_follow_up field. For some TCGA projects, in Data Release 42, data was migrated to the follow_ups.days_to_follow_up field. This resulted in an issue with missing cases for some TCGA projects in survival plots. The GDC is actively working on a fix. In the interim, users should create survival plots using the greatest value in the follow_ups.days_to_follow_up field.
GATK4 MuTect2 VCF Files:
- VCF files that were produced by GATK4 MuTect2 are missing all variants from chromosomes 10 and 20. New VCF files with a complete set of chromosomes will be available in a future data release. This includes any file with the following workflow types: GATK4 MuTect2, GATK4 MuTect2 Annotation,GATK4 MuTect2 Tumor-Only, and GATK4 MuTect2 Tumor-Only Annotation.

Data Release 41.0

GDC Product: Data
Release Date: August 28, 2024

New Updates

New Projects
- MATCH-C1
  - 11 cases
  - WXS, RNA-Seq
- MATCH-P
  - 28 cases
  - WXS, RNA-Seq
- MATCH-Z1B
  - 29 cases
  - WXS, RNA-Seq
New Cases from Existing Projects
- CPTAC-3 - 31 cases
New Data Sets
- TARGET-AML Tumor-Only Targeted Sequencing - 163 variant call sets
- TCGA U133 Submitted Expression Arrays
  - TCGA-GBM - 560 aliquots
  - TCGA-LAML - 183 aliquots
  - TCGA-LUSC - 135 aliquots
  - TCGA-OV - 548 aliquots
- TCGA-LUAD Methylation Data - 53 aliquots
- CDDP_EAGLE-1 Slide Images - 49 cases
- HCMI-CMDC
  - Tumor-Only WGS Data - 2 aliquot BAMs, 2 variant call sets
  - Tumor-Only WXS Data - 3 aliquot BAMs, 3 variant call sets
  - Updated clinical supplements
- BEATAML1.0-COHORT scRNA-Seq Data - 8 aliquots
Data Updates
- Indexing of ABSOLUTE Liftover copy number variation data
- Release of data for Other Clinical Attribute clinical entities
- platform field populated for harmonized data files, can be used as a filter in Repository

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

Fixed 4 TARGET-NBL gene expression sets that pointed to multiple cases/aliquots
Fixed multiple expression files per aliquot for several TARGET-AML RNA-Seq aliquots

Known Issues and Workarounds

The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor_grade property is not populated
- Progression_or_recurrence property is not populated

Data Release 40.0

GDC Product: Data
Release Date: March 29, 2024

New Updates

New Projects
- MATCH-R - Genomic Characterization CS-MATCH-0007 Arm R - phs002029
  - 28 cases
  - WXS, RNA-Seq
- MATCH-S1 - Genomic Characterization CS-MATCH-0007 Arm S1 - phs002153
  - 41 cases
  - WXS, RNA-Seq
- MATCH-S2 - Genomic Characterization CS-MATCH-0007 Arm S2 - phs002178
  - 3 cases
  - WXS, RNA-Seq
- MATCH-Z1I - Genomic Characterization CS-MATCH-0007 Arm Z1I - phs002058
  - 26 cases
  - WXS, RNA-Seq
New Cases from Existing Projects
- CPTAC-3 - 79 cases
- REBC-THYR - 9 cases
New Data Sets
- Targeted Sequencing
  - TARGET-AML - 1,596 aliquot BAMs, 769 variant calls
  - TARGET-NBL - 998 aliquot BAMs, 476 variant calls
  - TARGET-OS - 233 aliquot BAMs, 65 variant calls
- TCGA WGS
  - 57 alignments
  - 486 variant call aliquot pairs
- REBC-THYR
  - WGS - 90 aliquot BAMs, 69 variant calls
  - miRNA-Seq - 177 aliquots
  - RNA-Seq - 78 aliquots
  - RNA-Seq - Addition of STAR-Fusion data to existing aliquots
- HCMI-CMDC
  - Slide images for released cases
  - Updated clinical supplements
- TCGA-GBM
  - miRNA-Seq - 8 aliquots
  - RNA-Seq - 1 aliquot

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor_grade property is not populated
- Progression_or_recurrence property is not populated

Data Release 39.0

GDC Product: Data
Release Date: December 4, 2023

New Updates

New Projects
- MATCH-H - Genomic Characterization CS-MATCH-0007 Arm H - phs001888
  - 21 cases
  - WXS, RNA-Seq
- MATCH-I - Genomic Characterization CS-MATCH-0007 Arm I - phs002181
  - 60 cases
  - WXS, RNA-Seq
- MATCH-U - Genomic Characterization CS-MATCH-0007 Arm U - phs002179
  - 23 cases
  - WXS, RNA-Seq
- MATCH-W - Genomic Characterization CS-MATCH-0007 Arm W - phs001948
  - 45 cases
  - WXS, RNA-Seq
- MATCH-Z1A - Genomic Characterization CS-MATCH-0007 Arm Z1A - phs001973
  - 45 cases
  - WXS, RNA-Seq
New Cases from Existing Projects
- HCMI-CMDC - 19 cases
New Data Sets
- 6,957 WGS alignments from the TCGA program
- 1,002 sets of WGS variants from TCGA
- MP2PRT-ALL: WXS and RNA-Seq data
- Tumor-only data produced with a new pipeline. This includes raw and annotated VCFs and MAFs for the following projects. Note that all tumor-only variants are controlled-access:
  - BEATAML1.0-COHORT
  - BEATAML1.0-CRENOLANIB
  - CGCI-BLGSP
  - CPTAC-3
  - HCMI-CMDC
  - MATCH-B
  - MATCH-H
  - MATCH-I
  - MATCH-N
  - MATCH-Q
  - MATCH-U
  - MATCH-W
  - MATCH-Y
  - MATCH-Z1A
  - MATCH-Z1D
  - OHSU-CNL
  - ORGANOID-PANCREATIC
  - TARGET-ALL-P3
  - TARGET-WT
  - VAREPOP-APOLLO
New Metadata
- Sample type refactoring:
  - Four fields (tissue_type, specimen_type, preservation_method, tumor_descriptor) have been populated to contain the information that was previously populated in the sample_type field
  - The new field, specimen_type, is now available in the API to accommodate information about the biological makeup of the sample
- The follow up data for CPTAC-3 has been updated
Other Updates
- CNV mutations are now available on the exploration page for projects that only had ASCAT CNV data from WGS files. This includes CNV mutations for the following projects:
  - APOLLO-LUAD
  - CDDP_EAGLE-1
  - CGCI-BLGSP
  - CGCI-HTMCP-CC
  - CGCI-HTMCP-DLBCL
  - CGCI-HTMCP-LC
  - CPTAC-3
  - HCMI-CMDC
  - MP2PRT-ALL
  - REBC-THYR
- The GENIE program was removed from the GDC Portal because it was not representative of the latest version of GENIE
  - GENIE data can be accessed from the AACR Repositories

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the NCI's webpage on Using TARGET Data. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.

Data Release 38.0

GDC Product: Data
Release Date: August 31, 2023

New Updates

New Projects
- MP2PRT-ALL - Molecular Profiling to Predict Response to Treatment for Acute Lymphoblastic Leukemia - phs002005
  - 1,507 cases
  - WGS
- CGCI-HTMCP-DLBCL - HIV+ Tumor Molecular Characterization Project - Diffuse Large B-Cell Lymphoma - phs000235
  - 70 cases
  - WGS, RNA-Seq, miRNA-Seq, Tissue Slide Images
- MATCH-B - Genomic Characterization CS-MATCH-0007 Arm B - phs002028
  - 33 cases
  - WXS, RNA-Seq
- MATCH-N - Genomic Characterization CS-MATCH-0007 Arm N - phs002151
  - 21 cases
  - WXS, RNA-Seq
New Cases from Existing Projects
- CPTAC-3 - GBM and Kidney cohorts - 50 cases
- HCMI-CMDC - 31 cases
- CGCI-BLGSP - 204 cases
- TCGA-TGCT - 113 cases
New Data Sets
- 9,368 WGS alignments from the TCGA program
  - 4,676 Cases
  - 9,368 Aliquots
- All methylation files that were produced with the SeSAMe pipeline was replaced with a new version.
- TCGA SNP6 data processed with the ASCAT3 and ABSOLUTE pipelines
- 172 CEL and birdseed files from TCGA SNP6
- Release of remaining data for CGCI projects CGCI-BGLSP and CGCI-HTMCP-CC
New Metadata
- The wgs_coverage field is now populated for most BAMs and will allow for WGS BAMs to be queried by coverage range category.
- The QC metrics for applicable BAMs are now queryable through the GDC Data Portal and API.
- The msi_status and msi_score fields, which were produced using MSISensor2, are now queryable through the GDC Data Portal and API

A complete list of files included in the GDC Data Portal can be found below:

Bugs Fixed Since Last Release

The files produced with the SeSAMe pipeline had unfiltered methylation beta values that should be set as N/A for quality reasons. These files were replaced.
A bug in which certain files were shown to be associated with more aliquots than usual has been fixed.

Known Issues and Workarounds

The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the NCI's webpage on Using TARGET Data. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.

Data Release 37.0

GDC Product: Data
Release Date: March 29, 2023

New Updates

New Projects
- APOLLO-LUAD - Proteogenomic characterization of lung adenocarcinoma - phs003011
  - 87 cases
  - WGS, RNA-Seq
- CGCI-HTMCP-LC - HIV+ Tumor Molecular Characterization Project - Lung Cancer - phs000530
  - 39 cases
  - WGS, RNA-Seq, miRNA-Seq, Slide Images
- MATCH-Q - Genomic Characterization CS-MATCH-0007 Arm Q - phs001926
  - 35 cases
  - WXS, RNA-Seq
- MATCH-Y - Genomic Characterization CS-MATCH-0007 Arm Y - phs001904
  - 31 cases
  - WXS, RNA-Seq
New Data from Existing Projects
- CPTAC-3 - 139 new cases and two new snRNA-Seq samples
- HCMI-CMDC - 118 new cases
- TCGA-THCA - 941 new WGS alignments
- TARGET-OS and TARGET-ALL-P2 - Masked Somatic Mutation MAFs are now open access and their mutations now appear in the exploration portal.
Data Migrated from the Legacy Archive to Active Portal
- Birdseed files that were generated from Affymetrix SNP6 arrays
- Additional WGS Alignments are now available for TCGA projects
- Additional samples from RNA-Seq and WXS are now available for TCGA projects

A complete list of files included in the GDC Data Portal can be found below:

Unavailable Files

56 CPTAC-3 snRNA-Seq files are currently unavailable for download. A list of the affected files can be found here. These files will be restored for download by the next data release.

Bugs Fixed Since Last Release

Outcome data for the CPTAC program has been updated.
The age_at_index field was incorrectly reported in days in the GENIE program. These values have been removed as it contained the same information as the days_to_birth field.

Known Issues and Workarounds

The current files produced with the SeSAMe pipeline have unfiltered methylation beta values that should be set as N/A for quality reasons. These files will be replaced in a future release.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 36.0

GDC Product: Data
Release Date: December 12, 2022

New Updates

New Projects
- MATCH-Z1D - Genomic Characterization CS-MATCH-0007 Arm Z1D - phs001859
  - 36 cases
  - WXS, RNA-Seq
- CDDP_EAGLE-1 - CDDP Integrative Analysis of Lung Adenocarcinoma (Phase 2) - phs001239
  - 50 cases
  - WXS, WGS, RNA-Seq
New Data from Existing Projects
- CMI-MPC - new RNA-Seq and WXS data
Data Migrated from the Legacy Archive to Active Portal
- WGS Alignments are now available for 25 TCGA Projects
- Pathology reports from TCGA
- Affymetrix SNP6 Genotyping Array CEL files
- A set of WXS and RNA-Seq samples from TCGA and TARGET that failed harmonization at launch have been rerun and are now available in the active portal.
- TCGA Bisulfite-Seq files can be downloaded using the following manifests:
  - TARGET-RT
  - TCGA-BLCA
  - TCGA-BRCA
  - TCGA-COAD
  - TCGA-GBM
  - TCGA-LUAD
  - TCGA-LUSC
  - TCGA-READ
  - TCGA-STAD
  - TCGA-UCEC

A complete list of files included in the GDC Data Portal can be found below:

gdc_manifest_20221212_data_release_36.0_active.tsv.gz

Unavailable Files

None

Bugs Fixed Since Last Release

The copy number variation data is now available on the GDC Exploration portal.
The mutations on GDC Exploration were re-built with the correct gene model.

Known Issues and Workarounds

Outcome data for the CPTAC program is not up-to-date. Please visit the Proteomic Data Commons for updated outcome data for CPTAC.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 35.0

GDC Product: Data
Release Date: September 28, 2022

New Updates

The SomaticSniper variant calling pipeline was deprecated. To support this, the following changes were made:
- All SomaticSniper files no longer appear in the portal, but still can be downloaded using the Data Transfer Tool or API using the original UUID.
- The aggregated somatic mutation and masked somatic mutation files (multi-caller MAFs) have been replaced to reflect the absence of variants from the SomaticSniper pipeline.
- The mutations on the exploration portal reflect the above-mentioned masked somatic mutation files.
10 snRNA-Seq samples were released from the CPTAC-3 project.
Additional RNA-Seq samples from 2,082 additional cases are now available for the TARGET-AML project.
Demographic data has been added for 94 cases in TARGET-ALL-P2 and TARGET-ALL-P3 projects. A list of the updated cases can be found here.

A complete list of files included in the GDC Data Portal can be found below:

gdc_manifest_20220928_data_release_35.0_active.tsv.gz

Unavailable Files

None

Bugs Fixed Since Last Release

Data from two HCMI-CMDC aliquots (HCM-BROD-0100-C15-85A-01D-A786-36 and HCM-BROD-0679-C43-85M-01D-A80U-36) were incorrectly selected for inclusion into the Exploration Page in Data Release 32 and has been replaced with the correct aliquots (HCM-BROD-0100-C15-01A-11D-A786-36 and HCM-BROD-0679-C43-06A-11D-A80U-36).

Known Issues and Workarounds

The mutations on GDC Exploration were built with an incorrect gene model.
- The mutations are still correct in terms of the gene affected, coordinates, DNA changes, amino acid changes, and impact.
- Mutations associated with genes that were present in GENCODE v36 and not GENCODE v22 are not displayed. This affects less than 1% of mutations.
- Files downloaded from the the GDC Repository are not affected by this issue. This only affects mutations that are downloaded from GDC Exploration.
Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 34.0

GDC Product: Data
Release Date: July 27, 2022

New updates

251 cases from the CPTAC-3 project were added to the portal. This includes all files associated with these cases.
243 cases from the BEATAML1.0-COHORT project were added to the portal. This includes most of the files associated with these cases.
- The raw tumor-only VCFs from BEATAML1.0-COHORT are downloadable from the BEATAML1.0-COHORT (2022) publication page here and will be added to the Data Portal in a future release.
WXS mutations from the BEATAML1.0-COHORT project are now available in the Exploration portal.
Transcript fusion files are now available for the following projects:
- BEATAML1.0-COHORT
- CMI-ASC
- CMI-MBC
- CPTAC-2
- CTSP-DLBCL1
- MMRF-COMMPASS
- NCICCR-DLBCL
- OHSU-CNL
- ORGANOID-PANCREATIC
- WCDT-MCRPC

A complete list of files included in the GDC Data Portal can be found below:

gdc_manifest_20220727_data_release_34.0_active.tsv.gz

Unavailable Files

None

Bugs Fixed Since Last Release

Data from two HCMI-CMDC aliquots (HCM-BROD-0100-C15-85A-01D-A786-36 and HCM-BROD-0679-C43-85M-01D-A80U-36) were incorrectly selected for inclusion into the Exploration Page in Data Release 32 and has been replaced with the correct aliquots (HCM-BROD-0100-C15-01A-11D-A786-36 and HCM-BROD-0679-C43-06A-11D-A80U-36).

Known Issues and Workarounds

Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 33.1

GDC Product: Data
Release Date: May 31, 2022

New updates

None, see "Bugs Fixed Since Last Release" section below.

A complete list of files included in the GDC Data Portal can be found below:

gdc_manifest_20220531_data_release_33.1_active.tsv.gz

Unavailable Files

None

Bugs Fixed Since Last Release

32 cases from the EXCEPTIONAL_RESPONDERS-ER project were released as they were missing from the previous release.
All mutations from EXCEPTIONAL_RESPONDERS-ER in the exploration portal come from WXS data, whereas they were previously a mixture of WXS and Targeted Sequencing.

Known Issues and Workarounds

Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 33.0

GDC Product: Data
Release Date: May 3, 2022

New updates

New Project: NCI Exceptional Responders Initiative (EXCEPTIONAL_RESPONDERS-ER, phs001145)
- RNA-Seq - 45 Cases
- WXS - 50 Cases
- Targeted Sequencing - 41 Cases
- Mutations from WXS and Targeted Sequencing are present in the exploration page.
New Project: Molecular Profiling to Predict Response to Treatment - Wilms Tumor (MP2PRT-WT, phs001965)
- WGS - 52 Cases
- RNA-Seq - 52 Cases
- miRNA-Seq - 52 Cases
Methylation files from the SeSAMe pipeline are now available for CGCI-HTMCP-CC and the TARGET projects.

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Unavailable Files

The Arriba pipeline failed for one aliquot from EXCEPTIONAL-RESPONDERS-ER and is documented here.

Bugs Fixed Since Last Release

Gene-level copy number files from TCGA-THCA and TCGA-UCEC were set as controlled-access files. These have been corrected to be available as open-access files.
Due to a problem with the columns generated by the pipeline, all scRNA-Seq files have been replaced with a new version.

Known Issues and Workarounds

397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 32.0

GDC Product: Data - GENCODE v36 Release
Release Date: March 29, 2022

New updates

New data files

The following data types have been replaced with new GENCODE v36 versions
- RNA-Seq: all files, including alignments, gene expression files, and transcript fusion files.
- WXS and Targeted Sequencing: annotated VCFs, single-caller MAFs, Ensemble MAFs.
- WGS: BEDPE-format structural variants and gene-level copy number variants.
- GENIE Targeted Sequencing files.
- FM-AD Targeted Sequencing files.
  - The primary-site-level FM-AD MAF files have been replaced with aliquot-level MAF files.
RNA-Seq STAR-Counts files now contain additional normalized counts such as FPKM, FPKM-UQ, and TPM.
All WXS files for TCGA have been replaced with new versions. Alignments will contain QC metrics and variants were produced using the same pipelines as all other GDC projects.
TCGA RNA-Seq has been changed to contain three alignments (genomic, transcriptome, and chimeric), STAR-counts files, and transcript fusion files for each aliquot.
The project-level MAFs in TCGA and FM-AD have been replaced with aliquot-level MAFs.
GENCODE v22 derived files (not BAM) that no longer appear in the portal will be downloadable as previous versions of v36 files.
Methylation data produced from the SeSAMe pipeline is now available for all TCGA projects.
Note that miRNA-Seq data remains unchanged. The miRNA-Seq pipeline uses the miRBase database, which is not affected by the GENCODE version change.
A set of manifests were generated at the project-level that map each v22 file to its corresponding v36 file. These can be used to help users transition from v22 to v36 and can be downloaded here.

Removed data files and pipelines

Files from the HTSeq pipeline are no longer supported and will no longer appear in the portal. Normalized counts can now be found in the STAR-Counts files.
Files that originated from the methylation liftover pipeline are no longer supported and will no longer appear in the portal.
GENCODE v22 BAM files that no longer appear in the portal will be available for six months past this release. They may not be available after that.
New variant calling tumor-normal pairing was implemented in TCGA, which results in certain aliquots no longer being available as a v36 version (see the aliquots labeled "Unpaired Aliquots" here).
Some aliquots failed harmonization when the new v36 gene model was used, which results in some new versions no longer being available (see the aliquots labeled "Failed Harmonization" here).
Some aliquots were found to contain a cross-patient contamination level of over 0.04 as measured by GATK4 CalculateContamination (see the aliquots labeled "Contamination" here).

Data Portal Exploration Data

The Data Portal Exploration Page is now populated based on open-access mutations from analyses that used GENCODE v36.
Mutations from SomaticSniper will not appear on the Exploration page.
Due to the copy number variation pipeline transition from GISTIC to ASCAT, the CNV data was not included in the GDC Exploration page. This will be replaced in a future release once visualization of the new pipeline is fully assessed.
The TCGA program mutations have been processed using the same pipeline as all other projects, which resulted in a 26% reduction in the number of open-access mutations. Some points on this change are listed below with TCGA-BRCA as the benchmark project:
- 97% of the previously released open-access mutations are still discoverable in the new GDC controlled-access MAFs. This number increases to 99.95% when focusing only on mutations that were also called by MC3.
- Somatic mutations will now be removed from the Data Portal Exploration Page unless they are detected by more than one variant calling software. This accounts for 40% of the total reduction.
- Somatic mutations will now be removed from the Data Portal Exploration Page if they are detected outside of the target capture region, while previously out-of-target mutations detected from the TCGA Gene Annotation File (GAF) regions were allowed. This accounts for 36% of the total reduction.
- Some TCGA-specific variant-rescue steps have been removed in favor of a more robust and uniform filtering pipeline.
- Some other minor changes due to updates in the gene model or other databases (e.g., the ExAC germline variant database was replaced with gnomAD in DR32).

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 31.0

GDC Product: Data
Release Date: October 29, 2021

New updates

TCGA Slide Images:
- All TCGA slide images that were removed earlier this year have been restored.
- Note that the UUIDs for most TCGA slide images have changed. Older manifest files may not work when downloading slide images.
CPTAC-3 clinical data has been refreshed and includes new follow up entities.
REBC-THYR
- The clinical and biospecimen XML files were removed as they were not intended for release in DR 30.
- The case REBC-ADL5 was added, which includes one WGS pair.

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

One file from a previous version of the methylation pipeline appeared in the data portal (bd2f864a-3f00-47b5-815d-bd01ca21ef61; CPTAC-3). This file should no longer appear in the data portal.

Known Issues and Workarounds

The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 30.0

GDC Product: Data
Release Date: September 23, 2021

New updates

New Projects:
- TRIO-CRU (phs001163) - Ukrainian National Research Center for Radiation Medicine Trio Study
  - WGS Alignments
- REBC-THYR (phs001134) - Comprehensive genomic characterization of radiation-related papillary thyroid cancer in the Ukraine
  - miRNA-Seq
  - RNA-Seq
  - WGS
CPTAC Program
- CPTAC-3 methylation data produced from the SeSAMe pipeline is now available.
- CPTAC-2 miRNA-Seq files have been replaced with better quality data.
HCMI-CMDC
- 31 New cases have been released to the GDC Data Portal.
- Methylation data produced from the SeSAMe pipeline is now available.
TCGA
- Protein expression data (RPPA) is now available for 32 projects.
- RNA-Seq data for TCGA-TGCT was replaced with files from an updated pipeline.
TARGET-AML - New RNA-Seq and miRNA-Seq aliquots have been released.

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

One file from a previous version of the methylation pipeline appears in the data portal (bd2f864a-3f00-47b5-815d-bd01ca21ef61; CPTAC-3). This file cannot be downloaded, but may cause bulk downloads to fail. Remove this file from any manifest or cart you plan on downloading.
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 29.0

GDC Product: Data
Release Date: March 31, 2021

New updates

Count Me In Program
- Aliquot-level MAFs are now available for projects CMI-ASC, CMI-MBC, and CMI-MPC.
- Somatic mutation are now explorable for projects CMI-ASC, CMI-MBC, and CMI-MPC
CPTAC Program
- CPTAC-2 open-access somatic mutations are now browsable through the GDC Exploration Portal.
- MSI data is now browsable through the faceted search for CPTAC-2 and CPTAC-3.
HCMI-CMDC - Data files and explorable mutations for 18 new cases are now available.

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

The aggregated and masked MAF files that were missing for seven pancreatic cases in CPTAC-3 have been restored to the data portal.
The missing RNA-Seq data files for the seven normal pancreatic cases in CPTAC-3 have been restored to the data portal.

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
Data Release 28.0
- GDC Product: Data
- Release Date: February 2, 2021
New updates
1. New Project: CMI-MPC - Count Me In - The Metastatic Prostate Cancer Project
  - WXS alignments and variant calls (VCFs) are available.
2. New Data Type: Single nuclei (snRNA-Seq) data is now available for 18 CPTAC-3 cases. See the RNA-Seq documentation for details.
3. CPTAC-3
  - Data files for 147 new cases from the pancreatic cohort are now available.
  - CPTAC-3 open-access somatic mutations are now browsable through the GDC Exploration Portal.
  - RNA-Seq transcript fusion files are now available.
  - Targeted Sequencing alignments and raw tumor-only variant calls (VCF) are now available.
4. HCMI-CMDC
  - Data files for 22 new cases are now available.
  - The HCMI-CMDC open-access somatic mutations have been refreshed on the GDC Exploration Portal to reflect all newly released cases.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20210202_data_release_28.0_active.tsv.gz
- gdc_manifest_20210202_data_release_28.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The aggregated and masked MAF files for seven pancreatic cases in CPTAC-3 do not appear in the Data Portal. See below for download instructions.
  - This manifest can be used to download the files.
  - To download the raw aggregated MAF files, dbGaP access to CPTAC-3 (phs001287) is required. The masked MAF files are open-access.
  - The seven cases are as follows: C3L-04027, C3L-04080, C3N-02585, C3N-02768, C3N-02971, C3N-03754, and C3N-03839. The case the each file is associated with is denoted in the manifest.
- The RNA-Seq data files for the seven normal pancreatic cases in CPTAC-3 do not appear in the Data Portal. See below for download instructions.
  - This manifest can be used to download the files.
  - To download the alignments or splice-junction files, dbGaP access to CPTAC-3 (phs001287) is required. The other gene expression files are open-access.
  - The seven cases are as follows: C3L-03513, C3L-07032, C3L-07033, C3L-07034, C3L-07035, C3L-07036, C3L-07037. The case the each file is associated with is denoted in the manifest.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
  - Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
  - 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
  - Two tissue slide images are unavailable for download from GDC Data Portal
  - The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
  - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  - Tumor grade property is not populated
  - Progression_or_recurrence property is not populated
- TARGET projects
  - TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
    - TARGET-20-PASJGZ-04A-02D
    - TARGET-30-PAPTLY-01A-01D
    - TARGET-20-PAEIKD-09A-01D
    - TARGET-20-PASMYS-14A-02D
    - TARGET-20-PAMYAS-14A-02D
    - TARGET-10-PAPZST-09A-01D
  - 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
  - There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  - There are two cases with identical submitter_id TARGET-10-PARUYU
  - Some TARGET cases are missing days_to_last_follow_up
  - Some TARGET cases are missing age_at_diagnosis
  - Some TARGET files are not connected to all related aliquots
  - Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    - All microarray data and metadata
    - All sequencing analyzed data and metadata
    - 1180 of 12063 sequencing runs of raw data
  - Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  - No data from TARGET-MDLS is available.
- Issues in the Legacy Archive
  - The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  - SDF Files are not linked to Project or Case in the Legacy Archive
  - Two biotab files are not linked to Project or Case in the Legacy Archive
  - SDRF files are not linked to Project or Case in the Legacy Archive
  - TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 27.0 Bug Fix

GDC Product: Data
Release Date: November 9, 2020

New updates

None, see bug fix section below.

A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

Some files in projects CGCI-BLGSP, CGCI-HTMCP-CC, and HCMI-CMDC were marked on the portal as controlled-access, when they were supposed to be open-access. These are now downloadable as open-access files.

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 27.0

GDC Product: Data
Release Date: October 29, 2020

New updates

Initial release for the WGS variant calling pipeline. See the documentation on WGS variant calling for more details on the available files. This includes data from the following projects:
- CGCI-BLGSP
- CGCI-HTMCP-CC
- HCMI-CMDC
RNA-Seq transcript fusion files are available for the following projects:
- CGCI-BLGSP
- CGCI-HTMCP-CC
- HCMI-CMDC
Aliquot level MAFs were released for CGCI-HTMCP-CC Targeted Sequencing variants. Open access MAFs are included.
17 new cases were released for the HCMI-CMDC project. This includes WGS, WXS, and RNA-Seq data.
WGS alignments were released for 99 TCGA-LUAD cases (196 files).
Therapeutic agents (treatment) and tumor stage (diagnosis) properties were migrated to remove deprecated values and better adhere to a standardized set of values.

A complete list of files for DR27.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Some files in projects CGCI-BLGSP, CGCI-HTMCP-CC, and HCMI-CMDC are marked on the portal as controlled-access. These files are publicly downloadable using the Data Transfer Tool or API. All files from the following data types should be open-access within the previously specified projects: Biospecimen Supplement, Clinical Supplement, Gene Expression Quantification, Masked Somatic Mutation
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 26.0

GDC Product: Data
Release Date: September 8, 2020

New updates

New program released:
- Count Me In (CMI)
  - CMI-ASC - The Angiosarcoma Project
    - RNA-Seq
    - WXS
  - CMI-MBC - The Metastatic Breast Cancer Project
    - RNA-Seq
    - WXS
Somatic mutations are now available on the exploration portal for the following projects:
- MMRF-COMMPASS
- TARGET-ALL-P3
- TARGET-AML
- TARGET-NBL
- TARGET-WT
Primary sites and disease types were updated for multiple projects to correspond to GDC Dictionary updates.

A complete list of files for DR26.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

The CPTAC-3 head and neck cohort can now be queried by choosing the head and neck anatomic site on the GDC home page.

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
Data Release 25.0
- GDC Product: Data
- Release Date: July 22, 2020
New updates
1. New data types released:
  - RNA-Seq Transcript Fusion files were released for the following projects:
    - TARGET-ALL-P1
    - TARGET-ALL-P2
    - TARGET-ALL-P3
    - TARGET-CCSK
    - TARGET-NBL
    - TARGET-OS
    - TARGET-RT
    - TARGET-WT
  - The msi_status and msi_score properties can be queried on the GDC Portal for the CPTAC-3 project.
    - To query for these fields: go to the GDC Repository, click on "Add a File Filter" at the top left of the screen, type msi_score or msi_status in the field, and click on "msi_score" or "msi_status". This should bring up the corresponding filters to use on the portal.
2. 108 cases from the CPTAC-3 LSCC Cohort were released. Includes the following data types:
  - WXS
  - WGS
  - RNA-Seq
  - miRNA-Seq
3. Aliquot level MAFs were released for MMRF-COMMPASS WXS variants. Open access MAFs are included.
4. HCMI-CMDC open-access somatic mutations were released to the Exploration Portal.
A complete list of files for DR25.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200722_data_release_25.0_active.tsv.gz
- gdc_manifest_20200722_data_release_25.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- A few supplements from CGCI-BLGSP are now associated with their correct versions.
Known Issues and Workarounds
- Currently the CPTAC-3 HNSCC cohort does not appear when the "Head and Neck" primary site is selected from the GDC home page. This cohort can be queried by clicking here
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
  - Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
  - 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
  - Two tissue slide images are unavailable for download from GDC Data Portal
  - The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
  - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
  - Tumor grade property is not populated
  - Progression_or_recurrence property is not populated
- TARGET projects
  - TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
    - TARGET-20-PASJGZ-04A-02D
    - TARGET-30-PAPTLY-01A-01D
    - TARGET-20-PAEIKD-09A-01D
    - TARGET-20-PASMYS-14A-02D
    - TARGET-20-PAMYAS-14A-02D
    - TARGET-10-PAPZST-09A-01D
  - 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
  - There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
  - There are two cases with identical submitter_id TARGET-10-PARUYU
  - Some TARGET cases are missing days_to_last_follow_up
  - Some TARGET cases are missing age_at_diagnosis
  - Some TARGET files are not connected to all related aliquots
  - Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
  - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
    - All microarray data and metadata
    - All sequencing analyzed data and metadata
    - 1180 of 12063 sequencing runs of raw data
  - Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
  - No data from TARGET-MDLS is available.
- Issues in the Legacy Archive
  - The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
  - SDF Files are not linked to Project or Case in the Legacy Archive
  - Two biotab files are not linked to Project or Case in the Legacy Archive
  - SDRF files are not linked to Project or Case in the Legacy Archive
  - TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 24.0

GDC Product: Data
Release Date: May 7, 2020

New updates

New project released: CGCI-HTMCP-CC - HIV+ Tumor Molecular Characterization Project - Cervical Cancer
- RNA-Seq: Alignments and gene expression levels
- miRNA-Seq: Alignments and miRNA expression levels
- WGS: Alignments
- Targeted Sequencing: Alignments
110 new cases were released from the HNSCC cohort of CPTAC-3. This includes WXS, WGS, RNA-Seq and miRNA-Seq data.
Aliquot-level WXS MAFs are now available from the following projects:
- CPTAC-2
- CPTAC-3

A complete list of files for DR24.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Currently the CPTAC-3 HNSCC cohort does not appear when the "Head and Neck" primary site is selected from the GDC home page. This cohort can be queried by clicking here
The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 23.0

GDC Product: Data
Release Date: April 7, 2020

New updates

New data types released:
- Aliquot-level MAFs: MAF Files with mutations derived from one tumor/normal pair
  - HCMI-CMDC
  - TARGET-ALL-P2
  - TARGET-ALL-P3
  - TARGET-AML
  - TARGET-NBL
  - TARGET-OS
  - TARGET-WT
  - Note: Previously released TARGET project level MAFs can be downloaded with the following manifest: TARGET_Project-Level-MAF_GDC-Manifest.txt
- Copy number segment and estimate files from SNP6 ASCAT
  - All TCGA Projects
  - TARGET-ALL-P2
  - TARGET-AML
To accommodate users who prefer to use project-level MAFs, a MAF aggregation tool was developed by the GDC:
- Github Release
New RNA-Seq data was released from HCMI-CMDC for nine additional cases.
Clinical updates were performed for the following projects
- CGCI-BLGSP
- HCMI-CMDC
- WCDT-MCRPC

A complete list of files for DR23.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

The 6 HCMI-CMDC cases without clinical data now have clinical data.
Most of the "associated_entities" fields in CGCI-BLGSP were not populated correct, this has been resolved.

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 22.0

GDC Product: Data
Release Date: January 16, 2020

New updates

New projects released:
- WCDT-MCRPC - Genomic Characterization of Metastatic Castration Resistant Prostate Cancer (phs001648)
  - RNA-Seq; WGS Data
New data from HCMI-CMDC
- 16 New Cases
- Includes WXS, WGS, and RNA-Seq data
New data from CPTAC-3
- 108 New Cases
- Includes WXS, WGS, and RNA-Seq data
- miRNA-Seq data for currently released cases

A complete list of files for DR22.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
6 of the HCMI-CMDC cases are missing clinical nodes
- HCM-CSHL-0060-C18
- HCM-CSHL-0089-C25
- HCM-CSHL-0090-C25
- HCM-CSHL-0092-C25
- HCM-CSHL-0091-C25
- HCM-CSHL-0057-C18
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 21.0

GDC Product: Data
Release Date: December 10, 2019

New updates

New projects released:
- GENIE - AACR Project Genomics Evidence Neoplasia Information Exchange (phs001337)
  - Includes Targeted Sequencing, Transcript Fusion, Copy Number Estimate from GENIE 5.0
- AACR Project GENIE is divided by sequencing center:
  - GENIE-MSK
  - GENIE-DFCI
  - GENIE-MDA
  - GENIE-JHU
  - GENIE-UHN
  - GENIE-VICC
  - GENIE-GRCC
  - GENIE-NKI

A complete list of files for DR21.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 20.0

GDC Product: Data
Release Date: November 11, 2019

New updates

New projects released:
- CPTAC-2 - CPTAC Proteogenomic Confirmatory Study (phs000892)
  - Includes WXS, RNA-Seq, and miRNA-Seq
- OHSU-CNL - Genomic landscape of Neutrophilic Leukemias of Ambiguous Diagnosis (phs001799)
  - Includes WXS and RNA-Seq
  - No VCF files will be included at this time. They will follow in a later release.
New TARGET data released
- TARGET-OS: WGS, WXS
- TARGET-NBL: WGS
- TARGET-AML: miRNA
CGCI-BLGSP miRNA-Seq released

A complete list of files for DR20.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 19.1

GDC Product: Data
Release Date: November 6, 2019

New updates

The following cases are no longer available in the GDC Data Portal. They had no data files associated with them in DR 19 so there are no changes in file availability in this release.
- TARGET-00-NAAENF
- TARGET-00-NAAENG
- TARGET-00-NAAENH
- TARGET-00-NAAENI
- TARGET-00-NAAENJ
- TARGET-00-NAAENK
- TARGET-00-NAAENL
- TARGET-00-NAAENM
- TARGET-00-NAAENN
- TARGET-00-NAAENP
- TARGET-00-NAAENR
- TARGET-00-NAAEPE

A complete list of files for DR19.1 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 19.0

GDC Product: Data
Release Date: September 17, 2019

New updates

New projects released:
- BEATAML1.0-COHORT - Functional Genomic Landscape of Acute Myeloid Leukemia (phs001657)
  - Includes WXS and RNA-Seq
New TARGET data released
- TARGET-ALL-P1 RNA-Seq
- TARGET-ALL-P2 RNA-Seq, WXS, and miRNA-Seq
- TARGET-ALL-P3 miRNA-Seq
- TARGET-AML WXS, WGS, and miRNA-Seq
- TARGET-NBL WXS and RNA-Seq
- TARGET-RT WGS and RNA-Seq
- TARGET-WT WGS, WXS, and RNA-Seq
Additional CGCI-BLGSP WGS data released
Pindel VCFs released for TARGET-ALL-P2, TARGET-ALL-P3, TARGET-AML, TARGET-NBL, TARGET-WT, MMRF-COMMPASS, HCMI-CMDC, and CPTAC-3
Disease-specific staging properties for many projects were released

A complete list of files for DR19.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 18.0

GDC Product: Data
Release Date: July 8, 2019

New updates

New Projects released
- MMRF-COMMPASS - Multiple Myeloma CoMMpass Study (phs000748)
  - Includes WGS, WXS, and RNA-Seq
- ORGANOID-PANCREATIC - Pancreas Cancer Organoid Profiling (phs001611)
  - Includes WGS, WXS, and RNA-Seq
- TARGET-ALL-P1 - Acute Lymphoblastic Leukemia - Phase I (phs000218)
  - Includes WGS
- TARGET-ALL-P2 - Acute Lymphoblastic Leukemia - Phase II (phs000218)
  - Includes WGS
- CGCI-BLGSP - Burkitt Lymphoma Genome Sequencing Project (phs000235)
  - Includes WGS and RNA-Seq
New versions of RNA-Seq data for TARGET-ALL-P3
New RNA-Seq data for TARGET-CCSK
New RNA-Seq data for TARGET-OS

A complete list of files for DR18.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

New versions of RNA-Seq data for TARGET-ALL-P3 resolve issue with missing reads from BAM files.

Known Issues and Workarounds

Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 17.1

GDC Product: Data
Release Date: June 12, 2019

New updates

Rebuilt indices for NCICCR-DLBCL and CTSP-DLBCL1. Fewer files viewable in GDC Data Portal or API.

A complete list of files for DR17.1 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 17.0

GDC Product: Data
Release Date: June 5, 2019

New updates

New Projects released
- HCMI-CMDC - NCI Cancer Model Development for the Human Cancer Model Initiative (HCMI) (phs001486)
- BEATAML1.0-CRENOLANIB - Clinical Resistance to Crenolanib in Acute Myeloid Leukemia Due to Diverse Molecular Mechanisms (phs001628)
RNA-Seq data for NCICCR-DLBCL and CTSP-DLBCL1 are released
ATAC-Seq data for TCGA projects are released
CPTAC-3 RNA-Seq data are released
Clinical data updates for TCGA - to see parser code updates review API v1.20 release notes
Clinical data updates for other projects to accommodate migration of vital_status, days_to_birth, and days_to_death from the Diagnosis to the Demographic node

A complete list of files for DR17.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
TARGET projects
- TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
  - TARGET-20-PASJGZ-04A-02D
  - TARGET-30-PAPTLY-01A-01D
  - TARGET-20-PAEIKD-09A-01D
  - TARGET-20-PASMYS-14A-02D
  - TARGET-20-PAMYAS-14A-02D
  - TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
- There are two cases with identical submitter_id TARGET-10-PARUYU
- Some TARGET cases are missing days_to_last_follow_up
- Some TARGET cases are missing age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
  - All microarray data and metadata
  - All sequencing analyzed data and metadata
  - 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Issues in the Legacy Archive
- Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
- SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated

Data Release 16.0

GDC Product: Data
Release Date: March 26, 2019

New updates

The CPTAC-3 project (phs001287) is released with WXS and WGS data. RNA-Seq will be released at a later date. Additional project details can be found at on the CPTAC Data Source page.
TARGET-ALL-P3 (phs000218) WGS BAM files are released.
VAREPOP-APOLLO (phs001374) VCF files are released.

A complete list of files for DR16.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
Two tissue slide images are unavailable for download from GDC Data Portal
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 15.0

GDC Product: Data
Release Date: February 20, 2019

New updates

TARGET-ALL-P3 is now available and includes RNA-Seq and WXS data.
New RNA-Seq workflow is now being utilized for new projects. More details can be found in the RNA-Seq pipeline documentation.
New tumor only variant calling pipeline is now being utilized for new projects. More details can be found in the Tumor only pipeline documentation.

A complete list of files for DR15.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
Two tissue slide images are unavailable for download from GDC Data Portal
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 14.0

GDC Product: Data
Release Date: December 18, 2018

New updates

Copy Number Variation (CNV) data derived from GISTIC2 results are now available for download for TCGA projects
New miRNA data available for 181 aliquots for TARGET and TCGA
Released two SNP6 files (6cd4ef5e-324a-4ace-8779-7a33bd559c83, dfa89ee9-6ee5-460b-bd58-b5ca0e9cb7ac)
New versions of TCGA biospecimen supplements are available
Updated primary site for TCGA-AG-3881 to Unknown
8 New Harmonized WGS BAM files for TARGET-WT, TARGET-NBL, TARGET-AML added to the portal

A complete list of files for DR14.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

FM-AD clinial and biospecimen supplements are now correctly labeled as TSV rather than XLSX

Known Issues and Workarounds

TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
Two tissue slide images are unavailable for download from GDC Data Portal
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 13.0

GDC Product: Data
Release Date: September 27, 2018

New updates

Three new projects are released to the GDC (VAREPOP-APOLLO (phs001374), CTSP-DLBCL1 (phs001184), NCICCR-DLBCL (phs001444)
TARGET WGS alignments are released. VCFs will be provided in a later release
Clinical data was harmonized with ICD-O-3 terminology for TCGA properties case.primary_site, case.disease_type, diagnosis.primary_diagnosis, diagnosis.site_of_resection_or_biopsy, diagnosis.tissue_or_organ_of_origin
Redaction annotations applied to 11 aliquots in TCGA-DLBC
Redaction annotations applied to incorrectly trimmed miRNA file in the Legacy Achive

A complete list of files for DR13.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:

Bugs Fixed Since Last Release

253 files Copy Number Segment and Masked Copy Number Segment files were released. These were skipped in DR 12.0
36 Diagnostic TCGA slides were released. They were skipped in DR 12.0

Known Issues and Workarounds

506 Copy Number Segment and 36 Slide Image files are designated as controlled-access on the GDC Data Portal. These files are actually open-access and will be downloadable without a token using this manifest.
2 Copy Number Segment files from TCGA-TGCT do not appear on the GDC Portal. They can be downloaded using the Data Transfer Tool using the following UUIDs.
- 6cd4ef5e-324a-4ace-8779-7a33bd559c83 - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.nocnv_grch38.seg.v2.txt
- dfa89ee9-6ee5-460b-bd58-b5ca0e9cb7ac - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.grch38.seg.v2.txt
TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
Two tissue slide images are unavailable for download from GDC Data Portal
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 12.0

GDC Product: Data
Release Date: June 13, 2018

New updates

Updated clinical and biospecimen XML files for TCGA cases are available in the GDC Data Portal. Equivalent Legacy Archive files may no longer be up to date.
All biospecimen and clinical supplement files for TCGA projects formerly only found in the Legacy Archive have been updated and transferred to the GDC Data Portal. Equivalent Legacy Archive files and metadata retrieved from the API may no longer be up to date.
Diagnostic slides from TCGA are now available in the GDC Data Portal and Slide Image Viewer. They were formerly only available in the Legacy Archive.
Updated Copy Number Segment and Masked Copy Number Segment files are now available. These were generated using an improved mapping of hg38 coordinates for the Affymetrix SNP6.0 probe set.
VCF files containing SNVs produced from TARGET WGS CGI data are available. The variant calls were initially produced by CGI and lifted over to hg38.

Updated files for this release are listed here. A complete list of files for DR12.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

TARGET NBL RNA-Seq data is now associated with the correct aliquot.

Known Issues and Workarounds

Some Copy Number Segment and Masked Copy Number Segment were not replaced in DR 12.0. 253 files remain to be swapped in a later release
Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. This reflects how these original samples were handled.
36 Diagnostic TCGA slides are not yet available in the active GDC Portal. They are still available in the GDC Legacy Archive.
11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
Two tissue slide images are unavailable for download from GDC Data Portal
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 are not available. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 11.0

GDC Product: Data
Release Date: May 21, 2018

New updates

Updated miRNA files to remove QCFail reads. This included all BAM and downstream count files.
TCGA Tissue slide images now available in GDC Data Portal. Previously these were found only in the Legacy Archive

Updated files for this release are listed here. A complete list of files for DR11.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

N/A

Known Issues and Workarounds

Two tissue slide images are unavailable for download from GDC Data Portal
RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
miRNA alignments include QC failed reads.
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 10.1

GDC Product: Data
Release Date: February 15, 2018

New updates

Updated FM-AD clinical data to conform with Data Dictionary release v1.11

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
miRNA alignments include QC failed reads.
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 10.0

GDC Product: Data
Release Date: December 21, 2017

New updates

New TARGET files for all projects
TARGET updates for clinical and biospecimen data
Replace corrupted .bai files
Update TCGA and TARGET MAF files to include VarScan2 indels and more information in all_effects column
Update VarScan VCF files

Updated files for this release are listed here. A complete list of files for DR10.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

The raw and annotated VarScan VCF files for aliquot TCGA-VR-A8ET-01A-11D-A403-09 were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.

There are 5051 TARGET files for which experimental_strategy, data_format, platform, and data_subtype are blank
There are two cases with identical submitter_id TARGET-10-PARUYU
TARGET-MDLS cases do not have disease_type or primary_site populated
Some TARGET cases are missing days_to_last_follow_up
Some TARGET cases are missing age_at_diagnosis
Some TARGET files are not connected to all related aliquots
miRNA alignments include QC failed reads.
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 9.0

GDC Product: Data
Release Date: October 24, 2017

New updates

Foundation Medicine Data Release
This includes controlled-access VCF and MAF files as well as clinical and biospecimen supplements and metadata.
Original Foundation Medicine supplied data can be found on the Foundation Medicine Project Page.
Updated RNA-Seq data for TARGET NBL
Includes new BAM and count files

Updated files for this release are listed here. A complete list of files for DR9.0 are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

miRNA alignments include QC failed reads.
Samples of TARGET sample_type Recurrent Blood Derived Cancer - Bone Marrow are mislabeled as Recurrent Blood Derived Cancer - Peripheral Blood. A workaround is to look at the sample barcode, which is -04 for Recurrent Blood Derived Cancer - Bone Marrow. (e.g. TARGET-20-PAMYAS-04A-03R)
FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 8.0

GDC Product: Data
Release Date: August 22, 2017

New updates

Released updated miRNA quantification files to address double counting of some normalized counts described in DR7.0 release notes.

Updated files for this release are listed here. A Complete list of files for DR8.0 are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MDLS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 7.0

GDC Product: Data
Release Date: June 29, 2017

New updates

Updated public Mutation Annotation Format (MAF) files are now available. Updates include filtering to remove variants impacted by OxoG artifacts and those impacted by strand bias.
Protected MAF files are updated to include flags for OxoG and strand bias.
Annotated VCFs are updated to include flags for OxoG artifacts and strand bias.

Updated files for this release are listed here. A Complete list of files for DR7.0 are listed here

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
Reads that are mapped to multiple genomic locations are double counted in some of the GDC miRNA results. The GDC will release updated files correcting the issue in an upcoming release. The specific impacts are described further below:
- Isoform Expression Quantification files
  - Raw reads counts are accurate
  - Normalized counts are proportionally skewed (r^2=1.0)
- miRNA Expression Quantification files
  - A small proportion of miRNA counts are overestimated (mean r^2=0.9999)
  - Normalized counts are proportionally skewed (mean r^2=0.9999)
- miRNA BAM files
  - no impact
Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MLDS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Data Release 6.0

GDC Product: Data
Release Date: May 9, 2017

New updates

GDC updated public Mutation Annotation Format (MAF) files are now available. Updates include leveraging the MC3 variant filtering strategy, which results in more variants being recovered relative to the previous version. A detailed description of the new format can be found here.
Protected MAFs are updated to include additional variant annotation information
Some MuTect2 VCFs updated to include dbSNP and COSMIC annotations found in other VCFs

Updated files for this release are listed here.

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
Variants found in VCF and MAF files may contain OxoG artifacts, which are produced during library preparation and may result in the apparent substitutions of C to A or G to T in certain sequence contexts. In the future we will plan to label potential oxoG artifacts in the MAF files.
Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Some validated somatic mutations may not be present in open-access MAF files. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
No data from TARGET-MLDS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Details are provided in Data Release Manifest

Data Release 5.0

GDC Product: Data
Release Date: March 16, 2017

New updates

Additional annotations from TCGA DCC are available
- Complete list of updated TCGA files is found here
Clinical data added for TARGET ALL P1 and P2
Pathology reports now have submitter IDs as assigned by the BCR
TARGET Data refresh
- Most recent biospecimen and clinical information from the TARGET DCC. New imported files are listed here
- Updated indexed biospecimen and clinical metadata
- Updated SRA XMLs files
- Does not include updates to TARGET NBL

Bugs Fixed Since Last Release

Missing cases from TCGA-LAML were added to Legacy Archive
Biotab files are now linked to Projects and Cases in Legacy Archive

Known Issues and Workarounds

Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in MC3 or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
No data from TARGET-MLDS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
Two biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Tumor grade property is not populated
Progression_or_recurrence property is not populated

Details are provided in Data Release Manifest

Data Release 4.0

GDC Product: Data
Release Date: October 31, 2016

New updates

TARGET ALL P1 and P2 biospecimen and molecular data are now available in the Legacy Archive. Clinical data will be available in a later release.
Methylation data from 27k/450k Arrays has been lifted over to hg38 and is now available in the GDC Data Portal
Public MAF files are now available for VarScan2, MuSE, and SomaticSniper. MuTect2 MAFs were made available in a previous release.
Updated VCFs and MAF files are available for MuTect2 pipeline to compensate for WGA-related false positive indels. See additional information on that change here. A listing of replaced files is provided here.
Added submitter_id for Pathology Reports in Legacy Archive

Bugs Fixed Since Last Release

None

Known Issues and Workarounds

Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in COSMIC or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
No data from TARGET-MLDS is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
Biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Data Release 3.0

GDC Product: Data
Release Date: September 16, 2016

New updates

CCLE data now available (in the Legacy Archive only)
BMI calculation is corrected
Slide is now categorized as a Biospecimen entity

Bugs Fixed Since Last Release

BMI calculation is corrected

Known Issues and Workarounds

Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. Whether a sample underwent this process can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
No data from TARGET-PPTP is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
Biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Data Release 2.0

GDC Product: Data
Release Date: August 9, 2016

New updates

Additional data, previously available via CGHub and the TCGA DCC, is now available in the GDC
Better linking between files and their associated projects and cases in the Legacy Archive
MAF files are now available in the GDC Data Portal

Known Issues and Workarounds

Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. These are present in VCF and MAF files produced by the MuTect2 variant calling pipeline. This information can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
No data from TARGET-PPTP is available.
Slide barcodes (submitter_id values for Slide entities in the Legacy Archive) are not available
SDF Files are not linked to Project or Case in the Legacy Archive
There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
Biotab files are not linked to Project or Case in the Legacy Archive
SDRF files are not linked to Project or Case in the Legacy Archive
Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg

Details are provided in Data Release Manifest

Initial Data Release (1.0)

GDC Product: Data
Release Date: June 6, 2016

Available Program Data

The Cancer Genome Atlas (TCGA)
Therapeutically Applicable Research To Generate Effective Treatments (TARGET)

Available Harmonized Data

WXS
- Co-cleaned BAM files aligned to GRCh38 using BWA
mRNA-Seq
- BAM files aligned to GRCh38 using STAR 2-pass strategy
- Expression quantification using HTSeq
miRNA-Seq
- BAM files aligned to GRCh38 using BWA aln
- Expression quantification using BCCA miRNA Profiling Pipeline*
Genotyping Array
- CNV segmentation data

Known Issues and Workarounds

BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
All legacy files for TCGA are available in the GDC Legacy Archive, but not always linked back to cases depending on available metadata.
Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
No data from TARGET-PPTP is available.
Legacy data not available in harmonized form:
- Annotated VCF files from TARGET, anticipated in future data release
- TCGA data that failed harmonization or QC or have been newly updated in CGHub: ~1.0% of WXS aliquots, ~1.6% of RNA-Seq aliquots
- TARGET data that failed harmonization or QC, have been newly updated in CGHub, or whose project names are undergoing reorganization: ~76% of WXS aliquots, ~49% of RNA-Seq aliquots, ~57% of miRNA-Seq.
MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
MAFs are not yet available for query or search in the GDC Data Portal or API. You may download these files using the following manifests, which can be passed directly to the Data Transfer Tool. Links for the open-access TCGA MAFs are provided below for downloading individual files.
- Open-access MAFs manifest
- Controlled-access MAFs manifest

Details are provided in Data Release Manifest

Download Open-access MAF files

Please note that these links no longer point to files and will be updated in the future.