Data Release Notes
Version | Date |
---|---|
v41.0 | August 28, 2024 |
v40.0 | March 29, 2024 |
v39.0 | December 4, 2023 |
v38.0 | August 31, 2023 |
v37.0 | March 29, 2023 |
v36.0 | December 12, 2022 |
v35.0 | September 28, 2022 |
v34.0 | July 27, 2022 |
v33.1 | May 31, 2022 |
v33.0 | May 3, 2022 |
v32.0 | March 29, 2022 |
v31.0 | October 29, 2021 |
v30.0 | September 23, 2021 |
v29.0 | March 31, 2021 |
v28.0 | February 2, 2021 |
v27.0-fix | November 9, 2020 |
v27.0 | October 29, 2020 |
v26.0 | September 8, 2020 |
v25.0 | July 22, 2020 |
v24.0 | May 7, 2020 |
v23.0 | April 7, 2020 |
v22.0 | January 16, 2020 |
v21.0 | December 10, 2019 |
v20.0 | November 11, 2019 |
v19.1 | November 6, 2019 |
v19.0 | September 17, 2019 |
v18.0 | July 8, 2019 |
v17.1 | June 12, 2019 |
v17.0 | June 5, 2019 |
v16.0 | March 26, 2019 |
v15.0 | February 20, 2019 |
v14.0 | December 18, 2018 |
v13.0 | September 27, 2018 |
v12.0 | June 13, 2018 |
v11.0 | May 21, 2018 |
v10.1 | February 15, 2018 |
v10.0 | December 21, 2017 |
v9.0 | October 24, 2017 |
v8.0 | August 22, 2017 |
v7.0 | June 29, 2017 |
v6.0 | May 9, 2017 |
v5.0 | March 16, 2017 |
v4.0 | October 31, 2016 |
v3.0 | September 16, 2016 |
v2.0 | August 9, 2016 |
v1.0 | June 6, 2016 |
Data Release 41.0
- GDC Product: Data
- Release Date: August 28, 2024
New Updates
-
New Projects
- MATCH-C1
- 11 cases
- WXS, RNA-Seq
- MATCH-P
- 28 cases
- WXS, RNA-Seq
- MATCH-Z1B
- 29 cases
- WXS, RNA-Seq
- MATCH-C1
-
New Cases from Existing Projects
- CPTAC-3 - 31 cases
-
New Data Sets
- TARGET-AML Tumor-Only Targeted Sequencing - 163 variant call sets
- TCGA U133 Submitted Expression Arrays
- TCGA-GBM - 560 aliquots
- TCGA-LAML - 183 aliquots
- TCGA-LUSC - 135 aliquots
- TCGA-OV - 548 aliquots
- TCGA-LUAD Methylation Data - 53 aliquots
- CDDP_EAGLE-1 Slide Images - 49 cases
- HCMI-CMDC
- Tumor-Only WGS Data - 2 aliquot BAMs, 2 variant call sets
- Tumor-Only WXS Data - 3 aliquot BAMs, 3 variant call sets
- Updated clinical supplements
- BEATAML1.0-COHORT scRNA-Seq Data - 8 aliquots
-
Data Updates
- Indexing of ABSOLUTE Liftover copy number variation data
- Release of data for Other Clinical Attribute clinical entities
platform
field populated for harmonized data files, can be used as a filter inRepository
A complete list of files included in the GDC Data Portal can be found below:
- gdc_manifest_20240826_data_release_41.0_active.tsv.gz
- DR41 Project Level Manifests
- DR41 New Files Manifest
Bugs Fixed Since Last Release
- Fixed 4 TARGET-NBL gene expression sets that pointed to multiple cases/aliquots
- Fixed multiple expression files per aliquot for several TARGET-AML RNA-Seq aliquots
Known Issues and Workarounds
- The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor_grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 40.0
- GDC Product: Data
- Release Date: March 29, 2024
New Updates
-
New Projects
- MATCH-R - Genomic Characterization CS-MATCH-0007 Arm R - phs002029
- 28 cases
- WXS, RNA-Seq
- MATCH-S1 - Genomic Characterization CS-MATCH-0007 Arm S1 - phs002153
- 41 cases
- WXS, RNA-Seq
- MATCH-S2 - Genomic Characterization CS-MATCH-0007 Arm S2 - phs002178
- 3 cases
- WXS, RNA-Seq
- MATCH-Z1I - Genomic Characterization CS-MATCH-0007 Arm Z1I - phs002058
- 26 cases
- WXS, RNA-Seq
- MATCH-R - Genomic Characterization CS-MATCH-0007 Arm R - phs002029
-
New Cases from Existing Projects
- CPTAC-3 - 79 cases
- REBC-THYR - 9 cases
-
New Data Sets
- Targeted Sequencing
- TARGET-AML - 1,596 aliquot BAMs, 769 variant calls
- TARGET-NBL - 998 aliquot BAMs, 476 variant calls
- TARGET-OS - 233 aliquot BAMs, 65 variant calls
- TCGA WGS
- 57 alignments
- 486 variant call aliquot pairs
- REBC-THYR
- WGS - 90 aliquot BAMs, 69 variant calls
- miRNA-Seq - 177 aliquots
- RNA-Seq - 78 aliquots
- RNA-Seq - Addition of STAR-Fusion data to existing aliquots
- HCMI-CMDC
- Slide images for released cases
- Updated clinical supplements
- TCGA-GBM
- miRNA-Seq - 8 aliquots
- RNA-Seq - 1 aliquot
- Targeted Sequencing
A complete list of files included in the GDC Data Portal can be found below:
- gdc_manifest_27Mar2024_data_release_40.0_active.tsv.gz
- DR40 Project Level Manifests
- DR40 New Files Manifest
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor_grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 39.0
- GDC Product: Data
- Release Date: December 4, 2023
New Updates
-
New Projects
- MATCH-H - Genomic Characterization CS-MATCH-0007 Arm H - phs001888
- 21 cases
- WXS, RNA-Seq
- MATCH-I - Genomic Characterization CS-MATCH-0007 Arm I - phs002181
- 60 cases
- WXS, RNA-Seq
- MATCH-U - Genomic Characterization CS-MATCH-0007 Arm U - phs002179
- 23 cases
- WXS, RNA-Seq
- MATCH-W - Genomic Characterization CS-MATCH-0007 Arm W - phs001948
- 45 cases
- WXS, RNA-Seq
- MATCH-Z1A - Genomic Characterization CS-MATCH-0007 Arm Z1A - phs001973
- 45 cases
- WXS, RNA-Seq
- MATCH-H - Genomic Characterization CS-MATCH-0007 Arm H - phs001888
-
New Cases from Existing Projects
- HCMI-CMDC - 19 cases
-
New Data Sets
- 6,957 WGS alignments from the TCGA program
- 1,002 sets of WGS variants from TCGA
- MP2PRT-ALL: WXS and RNA-Seq data
- Tumor-only data produced with a new pipeline. This includes raw and annotated VCFs and MAFs for the following projects. Note that all tumor-only variants are controlled-access:
- BEATAML1.0-COHORT
- BEATAML1.0-CRENOLANIB
- CGCI-BLGSP
- CPTAC-3
- HCMI-CMDC
- MATCH-B
- MATCH-H
- MATCH-I
- MATCH-N
- MATCH-Q
- MATCH-U
- MATCH-W
- MATCH-Y
- MATCH-Z1A
- MATCH-Z1D
- OHSU-CNL
- ORGANOID-PANCREATIC
- TARGET-ALL-P3
- TARGET-WT
- VAREPOP-APOLLO
-
New Metadata
- Sample type refactoring:
- Four fields (tissue_type, specimen_type, preservation_method, tumor_descriptor) have been populated to contain the information that was previously populated in the sample_type field
- The new field, specimen_type, is now available in the API to accommodate information about the biological makeup of the sample
- The follow up data for CPTAC-3 has been updated
- Sample type refactoring:
-
Other Updates
- CNV mutations are now available on the exploration page for projects that only had ASCAT CNV data from WGS files. This includes CNV mutations for the following projects:
- APOLLO-LUAD
- CDDP_EAGLE-1
- CGCI-BLGSP
- CGCI-HTMCP-CC
- CGCI-HTMCP-DLBCL
- CGCI-HTMCP-LC
- CPTAC-3
- HCMI-CMDC
- MP2PRT-ALL
- REBC-THYR
- The GENIE program was removed from the GDC Portal because it was not representative of the latest version of GENIE
- GENIE data can be accessed from the AACR Repositories
- CNV mutations are now available on the exploration page for projects that only had ASCAT CNV data from WGS files. This includes CNV mutations for the following projects:
A complete list of files included in the GDC Data Portal can be found below:
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the NCI's webpage on Using TARGET Data. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Data Release 38.0
- GDC Product: Data
- Release Date: August 31, 2023
New Updates
-
New Projects
- MP2PRT-ALL - Molecular Profiling to Predict Response to Treatment for Acute Lymphoblastic Leukemia - phs002005
- 1,507 cases
- WGS
- CGCI-HTMCP-DLBCL - HIV+ Tumor Molecular Characterization Project - Diffuse Large B-Cell Lymphoma - phs000235
- 70 cases
- WGS, RNA-Seq, miRNA-Seq, Tissue Slide Images
- MATCH-B - Genomic Characterization CS-MATCH-0007 Arm B - phs002028
- 33 cases
- WXS, RNA-Seq
- MATCH-N - Genomic Characterization CS-MATCH-0007 Arm N - phs002151
- 21 cases
- WXS, RNA-Seq
- MP2PRT-ALL - Molecular Profiling to Predict Response to Treatment for Acute Lymphoblastic Leukemia - phs002005
-
New Cases from Existing Projects
- CPTAC-3 - GBM and Kidney cohorts - 50 cases
- HCMI-CMDC - 31 cases
- CGCI-BLGSP - 204 cases
- TCGA-TGCT - 113 cases
-
New Data Sets
- 9,368 WGS alignments from the TCGA program
- 4,676 Cases
- 9,368 Aliquots
- All methylation files that were produced with the SeSAMe pipeline was replaced with a new version.
- TCGA SNP6 data processed with the ASCAT3 and ABSOLUTE pipelines
- 172 CEL and birdseed files from TCGA SNP6
- Release of remaining data for CGCI projects CGCI-BGLSP and CGCI-HTMCP-CC
- 9,368 WGS alignments from the TCGA program
-
New Metadata
- The
wgs_coverage
field is now populated for most BAMs and will allow for WGS BAMs to be queried by coverage range category. - The QC metrics for applicable BAMs are now queryable through the GDC Data Portal and API.
- The
msi_status
andmsi_score
fields, which were produced using MSISensor2, are now queryable through the GDC Data Portal and API
- The
A complete list of files included in the GDC Data Portal can be found below:
Bugs Fixed Since Last Release
- The files produced with the SeSAMe pipeline had unfiltered methylation beta values that should be set as N/A for quality reasons. These files were replaced.
- A bug in which certain files were shown to be associated with more aliquots than usual has been fixed.
Known Issues and Workarounds
- The slide image viewer does not display for any non-TCGA slides. At this time, these slides will need to be downloaded and viewed locally. Additionally, the slide image viewer does not display properly for 14 TCGA slides, which are identified here.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the NCI's webpage on Using TARGET Data. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
Data Release 37.0
- GDC Product: Data
- Release Date: March 29, 2023
New Updates
-
New Projects
- APOLLO-LUAD - Proteogenomic characterization of lung adenocarcinoma - phs003011
- 87 cases
- WGS, RNA-Seq
- CGCI-HTMCP-LC - HIV+ Tumor Molecular Characterization Project - Lung Cancer - phs000530
- 39 cases
- WGS, RNA-Seq, miRNA-Seq, Slide Images
- MATCH-Q - Genomic Characterization CS-MATCH-0007 Arm Q - phs001926
- 35 cases
- WXS, RNA-Seq
- MATCH-Y - Genomic Characterization CS-MATCH-0007 Arm Y - phs001904
- 31 cases
- WXS, RNA-Seq
- APOLLO-LUAD - Proteogenomic characterization of lung adenocarcinoma - phs003011
-
New Data from Existing Projects
- CPTAC-3 - 139 new cases and two new snRNA-Seq samples
- HCMI-CMDC - 118 new cases
- TCGA-THCA - 941 new WGS alignments
- TARGET-OS and TARGET-ALL-P2 - Masked Somatic Mutation MAFs are now open access and their mutations now appear in the exploration portal.
-
Data Migrated from the Legacy Archive to Active Portal
- Birdseed files that were generated from Affymetrix SNP6 arrays
- Additional WGS Alignments are now available for TCGA projects
- Additional samples from RNA-Seq and WXS are now available for TCGA projects
A complete list of files included in the GDC Data Portal can be found below:
Unavailable Files
- 56 CPTAC-3 snRNA-Seq files are currently unavailable for download. A list of the affected files can be found here. These files will be restored for download by the next data release.
Bugs Fixed Since Last Release
- Outcome data for the CPTAC program has been updated.
- The
age_at_index
field was incorrectly reported in days in the GENIE program. These values have been removed as it contained the same information as thedays_to_birth
field.
Known Issues and Workarounds
- The current files produced with the SeSAMe pipeline have unfiltered methylation beta values that should be set as N/A for quality reasons. These files will be replaced in a future release.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 36.0
- GDC Product: Data
- Release Date: December 12, 2022
New Updates
-
New Projects
- MATCH-Z1D - Genomic Characterization CS-MATCH-0007 Arm Z1D - phs001859
- 36 cases
- WXS, RNA-Seq
- CDDP_EAGLE-1 - CDDP Integrative Analysis of Lung Adenocarcinoma (Phase 2) - phs001239
- 50 cases
- WXS, WGS, RNA-Seq
- MATCH-Z1D - Genomic Characterization CS-MATCH-0007 Arm Z1D - phs001859
-
New Data from Existing Projects
- CMI-MPC - new RNA-Seq and WXS data
-
Data Migrated from the Legacy Archive to Active Portal
- WGS Alignments are now available for 25 TCGA Projects
- Pathology reports from TCGA
- Affymetrix SNP6 Genotyping Array CEL files
- A set of WXS and RNA-Seq samples from TCGA and TARGET that failed harmonization at launch have been rerun and are now available in the active portal.
- TCGA Bisulfite-Seq files can be downloaded using the following manifests:
A complete list of files included in the GDC Data Portal can be found below:
Unavailable Files
- None
Bugs Fixed Since Last Release
- The copy number variation data is now available on the GDC Exploration portal.
- The mutations on GDC Exploration were re-built with the correct gene model.
Known Issues and Workarounds
- Outcome data for the CPTAC program is not up-to-date. Please visit the Proteomic Data Commons for updated outcome data for CPTAC.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 35.0
- GDC Product: Data
- Release Date: September 28, 2022
New Updates
- The SomaticSniper variant calling pipeline was deprecated. To support this, the following changes were made:
- All SomaticSniper files no longer appear in the portal, but still can be downloaded using the Data Transfer Tool or API using the original UUID.
- The aggregated somatic mutation and masked somatic mutation files (multi-caller MAFs) have been replaced to reflect the absence of variants from the SomaticSniper pipeline.
- The mutations on the exploration portal reflect the above-mentioned masked somatic mutation files.
- 10 snRNA-Seq samples were released from the CPTAC-3 project.
- Additional RNA-Seq samples from 2,082 additional cases are now available for the TARGET-AML project.
- Demographic data has been added for 94 cases in TARGET-ALL-P2 and TARGET-ALL-P3 projects. A list of the updated cases can be found here.
A complete list of files included in the GDC Data Portal can be found below:
Unavailable Files
- None
Bugs Fixed Since Last Release
- Data from two HCMI-CMDC aliquots (HCM-BROD-0100-C15-85A-01D-A786-36 and HCM-BROD-0679-C43-85M-01D-A80U-36) were incorrectly selected for inclusion into the Exploration Page in Data Release 32 and has been replaced with the correct aliquots (HCM-BROD-0100-C15-01A-11D-A786-36 and HCM-BROD-0679-C43-06A-11D-A80U-36).
Known Issues and Workarounds
- The mutations on GDC Exploration were built with an incorrect gene model.
- The mutations are still correct in terms of the gene affected, coordinates, DNA changes, amino acid changes, and impact.
- Mutations associated with genes that were present in GENCODE v36 and not GENCODE v22 are not displayed. This affects less than 1% of mutations.
- Files downloaded from the the GDC Repository are not affected by this issue. This only affects mutations that are downloaded from GDC Exploration.
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
- Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 34.0
- GDC Product: Data
- Release Date: July 27, 2022
New updates
- 251 cases from the CPTAC-3 project were added to the portal. This includes all files associated with these cases.
- 243 cases from the BEATAML1.0-COHORT project were added to the portal. This includes most of the files associated with these cases.
- The raw tumor-only VCFs from BEATAML1.0-COHORT are downloadable from the BEATAML1.0-COHORT (2022) publication page here and will be added to the Data Portal in a future release.
- WXS mutations from the BEATAML1.0-COHORT project are now available in the Exploration portal.
- Transcript fusion files are now available for the following projects:
- BEATAML1.0-COHORT
- CMI-ASC
- CMI-MBC
- CPTAC-2
- CTSP-DLBCL1
- MMRF-COMMPASS
- NCICCR-DLBCL
- OHSU-CNL
- ORGANOID-PANCREATIC
- WCDT-MCRPC
A complete list of files included in the GDC Data Portal can be found below:
Unavailable Files
- None
Bugs Fixed Since Last Release
- Data from two HCMI-CMDC aliquots (HCM-BROD-0100-C15-85A-01D-A786-36 and HCM-BROD-0679-C43-85M-01D-A80U-36) were incorrectly selected for inclusion into the Exploration Page in Data Release 32 and has been replaced with the correct aliquots (HCM-BROD-0100-C15-01A-11D-A786-36 and HCM-BROD-0679-C43-06A-11D-A80U-36).
Known Issues and Workarounds
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
- Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 33.1
- GDC Product: Data
- Release Date: May 31, 2022
New updates
- None, see "Bugs Fixed Since Last Release" section below.
A complete list of files included in the GDC Data Portal can be found below:
Unavailable Files
- None
Bugs Fixed Since Last Release
- 32 cases from the EXCEPTIONAL_RESPONDERS-ER project were released as they were missing from the previous release.
- All mutations from EXCEPTIONAL_RESPONDERS-ER in the exploration portal come from WXS data, whereas they were previously a mixture of WXS and Targeted Sequencing.
Known Issues and Workarounds
- Pathology reports do not have any associated case/biospecimen information in the portal. This information can be found in the reports themselves.
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
- Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 33.0
- GDC Product: Data
- Release Date: May 3, 2022
New updates
- New Project: NCI Exceptional Responders Initiative (EXCEPTIONAL_RESPONDERS-ER, phs001145)
- RNA-Seq - 45 Cases
- WXS - 50 Cases
- Targeted Sequencing - 41 Cases
- Mutations from WXS and Targeted Sequencing are present in the exploration page.
- New Project: Molecular Profiling to Predict Response to Treatment - Wilms Tumor (MP2PRT-WT, phs001965)
- WGS - 52 Cases
- RNA-Seq - 52 Cases
- miRNA-Seq - 52 Cases
- Methylation files from the SeSAMe pipeline are now available for CGCI-HTMCP-CC and the TARGET projects.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20220503_data_release_33.0_active.tsv.gz
- gdc_manifest_20220503_data_release_33.0_legacy.tsv.gz
Unavailable Files
- The Arriba pipeline failed for one aliquot from EXCEPTIONAL-RESPONDERS-ER and is documented here.
Bugs Fixed Since Last Release
- Gene-level copy number files from TCGA-THCA and TCGA-UCEC were set as controlled-access files. These have been corrected to be available as open-access files.
- Due to a problem with the columns generated by the pipeline, all scRNA-Seq files have been replaced with a new version.
Known Issues and Workarounds
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
- Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 32.0
- GDC Product: Data - GENCODE v36 Release
- Release Date: March 29, 2022
New updates
New data files
- The following data types have been replaced with new GENCODE v36 versions
- RNA-Seq: all files, including alignments, gene expression files, and transcript fusion files.
- WXS and Targeted Sequencing: annotated VCFs, single-caller MAFs, Ensemble MAFs.
- WGS: BEDPE-format structural variants and gene-level copy number variants.
- GENIE Targeted Sequencing files.
- FM-AD Targeted Sequencing files.
- The primary-site-level FM-AD MAF files have been replaced with aliquot-level MAF files.
- RNA-Seq STAR-Counts files now contain additional normalized counts such as FPKM, FPKM-UQ, and TPM.
- All WXS files for TCGA have been replaced with new versions. Alignments will contain QC metrics and variants were produced using the same pipelines as all other GDC projects.
- TCGA RNA-Seq has been changed to contain three alignments (genomic, transcriptome, and chimeric), STAR-counts files, and transcript fusion files for each aliquot.
- The project-level MAFs in TCGA and FM-AD have been replaced with aliquot-level MAFs.
- GENCODE v22 derived files (not BAM) that no longer appear in the portal will be downloadable as previous versions of v36 files.
- Methylation data produced from the SeSAMe pipeline is now available for all TCGA projects.
- Note that miRNA-Seq data remains unchanged. The miRNA-Seq pipeline uses the miRBase database, which is not affected by the GENCODE version change.
- A set of manifests were generated at the project-level that map each v22 file to its corresponding v36 file. These can be used to help users transition from v22 to v36 and can be downloaded here.
Removed data files and pipelines
- Files from the HTSeq pipeline are no longer supported and will no longer appear in the portal. Normalized counts can now be found in the STAR-Counts files.
- Files that originated from the methylation liftover pipeline are no longer supported and will no longer appear in the portal.
- GENCODE v22 BAM files that no longer appear in the portal will be available for six months past this release. They may not be available after that.
- New variant calling tumor-normal pairing was implemented in TCGA, which results in certain aliquots no longer being available as a v36 version (see the aliquots labeled "Unpaired Aliquots" here).
- Some aliquots failed harmonization when the new v36 gene model was used, which results in some new versions no longer being available (see the aliquots labeled "Failed Harmonization" here).
- Some aliquots were found to contain a cross-patient contamination level of over 0.04 as measured by GATK4 CalculateContamination (see the aliquots labeled "Contamination" here).
Data Portal Exploration Data
- The Data Portal Exploration Page is now populated based on open-access mutations from analyses that used GENCODE v36.
- Mutations from SomaticSniper will not appear on the Exploration page.
- Due to the copy number variation pipeline transition from GISTIC to ASCAT, the CNV data was not included in the GDC Exploration page. This will be replaced in a future release once visualization of the new pipeline is fully assessed.
- The TCGA program mutations have been processed using the same pipeline as all other projects, which resulted in a 26% reduction in the number of open-access mutations. Some points on this change are listed below with TCGA-BRCA as the benchmark project:
- 97% of the previously released open-access mutations are still discoverable in the new GDC controlled-access MAFs. This number increases to 99.95% when focusing only on mutations that were also called by MC3.
- Somatic mutations will now be removed from the Data Portal Exploration Page unless they are detected by more than one variant calling software. This accounts for 40% of the total reduction.
- Somatic mutations will now be removed from the Data Portal Exploration Page if they are detected outside of the target capture region, while previously out-of-target mutations detected from the TCGA Gene Annotation File (GAF) regions were allowed. This accounts for 36% of the total reduction.
- Some TCGA-specific variant-rescue steps have been removed in favor of a more robust and uniform filtering pipeline.
- Some other minor changes due to updates in the gene model or other databases (e.g., the ExAC germline variant database was replaced with gnomAD in DR32).
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20220316_data_release_32.0_active.tsv.gz
- gdc_manifest_20220316_data_release_32.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- 397 alignments from the TCGA program were found to have contamination values over 0.04 (alignment list). The ensemble MAFs produced by these alignments were removed from the Data Portal.
- One methylation aliquot from the TCGA-COAD project, TCGA-D5-6930-01A-11D-1926-05, was not added to the portal and will be added in a future release.
- The clinical supplement for TARGET-ALL-P1 is not currently available. It will be made available in a future release.
- Copy number variations currently do not appear in the Exploration page. This will be restored in a future release.
- Mutations from SomaticSniper were erroneously labelled as LOH (loss of heterozygosity). This affects the VCF files, MAF files, and may cause SomaticSniper mutations to be absent from ensemble MAFs.
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 31.0
- GDC Product: Data
- Release Date: October 29, 2021
New updates
- TCGA Slide Images:
- All TCGA slide images that were removed earlier this year have been restored.
- Note that the UUIDs for most TCGA slide images have changed. Older manifest files may not work when downloading slide images.
- CPTAC-3 clinical data has been refreshed and includes new follow up entities.
- REBC-THYR
- The clinical and biospecimen XML files were removed as they were not intended for release in DR 30.
- The case REBC-ADL5 was added, which includes one WGS pair.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20211029_data_release_31.0_active.tsv.gz
- gdc_manifest_20211029_data_release_31.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- One file from a previous version of the methylation pipeline appeared in the data portal (bd2f864a-3f00-47b5-815d-bd01ca21ef61; CPTAC-3). This file should no longer appear in the data portal.
Known Issues and Workarounds
- The slide image viewer does not display properly for 14 slides, which are identified here. The full slide image can be downloaded as an SVS file.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 30.0
- GDC Product: Data
- Release Date: September 23, 2021
New updates
- New Projects:
- TRIO-CRU (phs001163) - Ukrainian National Research Center for Radiation Medicine Trio Study
- WGS Alignments
- REBC-THYR (phs001134) - Comprehensive genomic characterization of radiation-related papillary thyroid cancer in the Ukraine
- miRNA-Seq
- RNA-Seq
- WGS
- TRIO-CRU (phs001163) - Ukrainian National Research Center for Radiation Medicine Trio Study
- CPTAC Program
- CPTAC-3 methylation data produced from the SeSAMe pipeline is now available.
- CPTAC-2 miRNA-Seq files have been replaced with better quality data.
- HCMI-CMDC
- 31 New cases have been released to the GDC Data Portal.
- Methylation data produced from the SeSAMe pipeline is now available.
- TCGA
- Protein expression data (RPPA) is now available for 32 projects.
- RNA-Seq data for TCGA-TGCT was replaced with files from an updated pipeline.
- TARGET-AML - New RNA-Seq and miRNA-Seq aliquots have been released.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20210923_data_release_30.0_active.tsv.gz
- gdc_manifest_20210923_data_release_30.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- One file from a previous version of the methylation pipeline appears in the data portal (bd2f864a-3f00-47b5-815d-bd01ca21ef61; CPTAC-3). This file cannot be downloaded, but may cause bulk downloads to fail. Remove this file from any manifest or cart you plan on downloading.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 29.0
- GDC Product: Data
- Release Date: March 31, 2021
New updates
- Count Me In Program
- Aliquot-level MAFs are now available for projects CMI-ASC, CMI-MBC, and CMI-MPC.
- Somatic mutation are now explorable for projects CMI-ASC, CMI-MBC, and CMI-MPC
- CPTAC Program
- CPTAC-2 open-access somatic mutations are now browsable through the GDC Exploration Portal.
- MSI data is now browsable through the faceted search for CPTAC-2 and CPTAC-3.
- HCMI-CMDC - Data files and explorable mutations for 18 new cases are now available.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20210331_data_release_29.0_active.tsv.gz
- gdc_manifest_20210331_data_release_29.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- The aggregated and masked MAF files that were missing for seven pancreatic cases in CPTAC-3 have been restored to the data portal.
- The missing RNA-Seq data files for the seven normal pancreatic cases in CPTAC-3 have been restored to the data portal.
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
-
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
Data Release 28.0
- GDC Product: Data
- Release Date: February 2, 2021
New updates
- New Project: CMI-MPC - Count Me In - The Metastatic Prostate Cancer Project
- WXS alignments and variant calls (VCFs) are available.
- New Data Type: Single nuclei (snRNA-Seq) data is now available for 18 CPTAC-3 cases. See the RNA-Seq documentation for details.
- CPTAC-3
- Data files for 147 new cases from the pancreatic cohort are now available.
- CPTAC-3 open-access somatic mutations are now browsable through the GDC Exploration Portal.
- RNA-Seq transcript fusion files are now available.
- Targeted Sequencing alignments and raw tumor-only variant calls (VCF) are now available.
- HCMI-CMDC
- Data files for 22 new cases are now available.
- The HCMI-CMDC open-access somatic mutations have been refreshed on the GDC Exploration Portal to reflect all newly released cases.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20210202_data_release_28.0_active.tsv.gz
- gdc_manifest_20210202_data_release_28.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The aggregated and masked MAF files for seven pancreatic cases in CPTAC-3 do not appear in the Data Portal. See below for download instructions.
- This manifest can be used to download the files.
- To download the raw aggregated MAF files, dbGaP access to CPTAC-3 (phs001287) is required. The masked MAF files are open-access.
- The seven cases are as follows: C3L-04027, C3L-04080, C3N-02585, C3N-02768, C3N-02971, C3N-03754, and C3N-03839. The case the each file is associated with is denoted in the manifest.
- The RNA-Seq data files for the seven normal pancreatic cases in CPTAC-3 do not appear in the Data Portal. See below for download instructions.
- This manifest can be used to download the files.
- To download the alignments or splice-junction files, dbGaP access to CPTAC-3 (phs001287) is required. The other gene expression files are open-access.
- The seven cases are as follows: C3L-03513, C3L-07032, C3L-07033, C3L-07034, C3L-07035, C3L-07036, C3L-07037. The case the each file is associated with is denoted in the manifest.
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 27.0 Bug Fix
- GDC Product: Data
- Release Date: November 9, 2020
New updates
- None, see bug fix section below.
A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20201109_data_release_27.0_active.tsv.gz
- gdc_manifest_20201109_data_release_27.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- Some files in projects CGCI-BLGSP, CGCI-HTMCP-CC, and HCMI-CMDC were marked on the portal as controlled-access, when they were supposed to be open-access. These are now downloadable as open-access files.
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 27.0
- GDC Product: Data
- Release Date: October 29, 2020
New updates
- Initial release for the WGS variant calling pipeline. See the documentation on WGS variant calling for more details on the available files. This includes data from the following projects:
- CGCI-BLGSP
- CGCI-HTMCP-CC
- HCMI-CMDC
- RNA-Seq transcript fusion files are available for the following projects:
- CGCI-BLGSP
- CGCI-HTMCP-CC
- HCMI-CMDC
- Aliquot level MAFs were released for CGCI-HTMCP-CC Targeted Sequencing variants. Open access MAFs are included.
- 17 new cases were released for the HCMI-CMDC project. This includes WGS, WXS, and RNA-Seq data.
- WGS alignments were released for 99 TCGA-LUAD cases (196 files).
- Therapeutic agents (treatment) and tumor stage (diagnosis) properties were migrated to remove deprecated values and better adhere to a standardized set of values.
A complete list of files for DR27.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20201029_data_release_27.0_active.tsv.gz
- gdc_manifest_20201029_data_release_27.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Some files in projects CGCI-BLGSP, CGCI-HTMCP-CC, and HCMI-CMDC are marked on the portal as controlled-access. These files are publicly downloadable using the Data Transfer Tool or API. All files from the following data types should be open-access within the previously specified projects: Biospecimen Supplement, Clinical Supplement, Gene Expression Quantification, Masked Somatic Mutation
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 26.0
- GDC Product: Data
- Release Date: September 8, 2020
New updates
- New program released:
- Count Me In (CMI)
- CMI-ASC - The Angiosarcoma Project
- RNA-Seq
- WXS
- CMI-MBC - The Metastatic Breast Cancer Project
- RNA-Seq
- WXS
- CMI-ASC - The Angiosarcoma Project
- Count Me In (CMI)
- Somatic mutations are now available on the exploration portal for the following projects:
- MMRF-COMMPASS
- TARGET-ALL-P3
- TARGET-AML
- TARGET-NBL
- TARGET-WT
- Primary sites and disease types were updated for multiple projects to correspond to GDC Dictionary updates.
A complete list of files for DR26.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200908_data_release_26.0_active.tsv.gz
- gdc_manifest_20200908_data_release_26.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- The CPTAC-3 head and neck cohort can now be queried by choosing the head and neck anatomic site on the GDC home page.
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
-
Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
Data Release 25.0
- GDC Product: Data
- Release Date: July 22, 2020
New updates
- New data types released:
- RNA-Seq Transcript Fusion files were released for the following projects:
- TARGET-ALL-P1
- TARGET-ALL-P2
- TARGET-ALL-P3
- TARGET-CCSK
- TARGET-NBL
- TARGET-OS
- TARGET-RT
- TARGET-WT
- The msi_status and msi_score properties can be queried on the GDC Portal for the CPTAC-3 project.
- To query for these fields: go to the GDC Repository, click on "Add a File Filter" at the top left of the screen, type msi_score or msi_status in the field, and click on "msi_score" or "msi_status". This should bring up the corresponding filters to use on the portal.
- RNA-Seq Transcript Fusion files were released for the following projects:
- 108 cases from the CPTAC-3 LSCC Cohort were released. Includes the following data types:
- WXS
- WGS
- RNA-Seq
- miRNA-Seq
- Aliquot level MAFs were released for MMRF-COMMPASS WXS variants. Open access MAFs are included.
- HCMI-CMDC open-access somatic mutations were released to the Exploration Portal.
A complete list of files for DR25.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200722_data_release_25.0_active.tsv.gz
- gdc_manifest_20200722_data_release_25.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- A few supplements from CGCI-BLGSP are now associated with their correct versions.
Known Issues and Workarounds
- Currently the CPTAC-3 HNSCC cohort does not appear when the "Head and Neck" primary site is selected from the GDC home page. This cohort can be queried by clicking here
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 24.0
- GDC Product: Data
- Release Date: May 7, 2020
New updates
-
New project released: CGCI-HTMCP-CC - HIV+ Tumor Molecular Characterization Project - Cervical Cancer
- RNA-Seq: Alignments and gene expression levels
- miRNA-Seq: Alignments and miRNA expression levels
- WGS: Alignments
- Targeted Sequencing: Alignments
-
110 new cases were released from the HNSCC cohort of CPTAC-3. This includes WXS, WGS, RNA-Seq and miRNA-Seq data.
-
Aliquot-level WXS MAFs are now available from the following projects:
- CPTAC-2
- CPTAC-3
A complete list of files for DR24.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200507_data_release_24.0_active.tsv.gz
- gdc_manifest_20200507_data_release_24.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Currently the CPTAC-3 HNSCC cohort does not appear when the "Head and Neck" primary site is selected from the GDC home page. This cohort can be queried by clicking here
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 23.0
- GDC Product: Data
- Release Date: April 7, 2020
New updates
-
New data types released:
- Aliquot-level MAFs: MAF Files with mutations derived from one tumor/normal pair
- HCMI-CMDC
- TARGET-ALL-P2
- TARGET-ALL-P3
- TARGET-AML
- TARGET-NBL
- TARGET-OS
- TARGET-WT
- Note: Previously released TARGET project level MAFs can be downloaded with the following manifest: TARGET_Project-Level-MAF_GDC-Manifest.txt
- Copy number segment and estimate files from SNP6 ASCAT
- All TCGA Projects
- TARGET-ALL-P2
- TARGET-AML
- Aliquot-level MAFs: MAF Files with mutations derived from one tumor/normal pair
-
To accommodate users who prefer to use project-level MAFs, a MAF aggregation tool was developed by the GDC:
-
New RNA-Seq data was released from HCMI-CMDC for nine additional cases.
-
Clinical updates were performed for the following projects
- CGCI-BLGSP
- HCMI-CMDC
- WCDT-MCRPC
A complete list of files for DR23.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200407_data_release_23.0_active.tsv.gz
- gdc_manifest_20200407_data_release_23.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- The 6 HCMI-CMDC cases without clinical data now have clinical data.
- Most of the "associated_entities" fields in CGCI-BLGSP were not populated correct, this has been resolved.
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 22.0
- GDC Product: Data
- Release Date: January 16, 2020
New updates
- New projects released:
- WCDT-MCRPC - Genomic Characterization of Metastatic Castration Resistant Prostate Cancer (phs001648)
- RNA-Seq; WGS Data
- WCDT-MCRPC - Genomic Characterization of Metastatic Castration Resistant Prostate Cancer (phs001648)
- New data from HCMI-CMDC
- 16 New Cases
- Includes WXS, WGS, and RNA-Seq data
- New data from CPTAC-3
- 108 New Cases
- Includes WXS, WGS, and RNA-Seq data
- miRNA-Seq data for currently released cases
A complete list of files for DR22.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20200116_data_release_22.0_active.tsv.gz
- gdc_manifest_20200116_data_release_22.0_legacy.tsv.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- 6 of the HCMI-CMDC cases are missing clinical nodes
- HCM-CSHL-0060-C18
- HCM-CSHL-0089-C25
- HCM-CSHL-0090-C25
- HCM-CSHL-0092-C25
- HCM-CSHL-0091-C25
- HCM-CSHL-0057-C18
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 21.0
- GDC Product: Data
- Release Date: December 10, 2019
New updates
- New projects released:
- GENIE - AACR Project Genomics Evidence Neoplasia Information Exchange (phs001337)
- Includes Targeted Sequencing, Transcript Fusion, Copy Number Estimate from GENIE 5.0
- AACR Project GENIE is divided by sequencing center:
- GENIE-MSK
- GENIE-DFCI
- GENIE-MDA
- GENIE-JHU
- GENIE-UHN
- GENIE-VICC
- GENIE-GRCC
- GENIE-NKI
- GENIE - AACR Project Genomics Evidence Neoplasia Information Exchange (phs001337)
A complete list of files for DR21.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20191210_data_release_21.0_active.txt.gz
- gdc_manifest_20191210_data_release_21.0_legacy.txt.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format.
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 20.0
- GDC Product: Data
- Release Date: November 11, 2019
New updates
- New projects released:
- CPTAC-2 - CPTAC Proteogenomic Confirmatory Study (phs000892)
- Includes WXS, RNA-Seq, and miRNA-Seq
- OHSU-CNL - Genomic landscape of Neutrophilic Leukemias of Ambiguous Diagnosis (phs001799)
- Includes WXS and RNA-Seq
- No VCF files will be included at this time. They will follow in a later release.
- CPTAC-2 - CPTAC Proteogenomic Confirmatory Study (phs000892)
- New TARGET data released
- TARGET-OS: WGS, WXS
- TARGET-NBL: WGS
- TARGET-AML: miRNA
- CGCI-BLGSP miRNA-Seq released
A complete list of files for DR20.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20191111_data_release_20.0_active.txt.gz
- gdc_manifest_20191111_data_release_20.0_legacy.txt.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 19.1
- GDC Product: Data
- Release Date: November 6, 2019
New updates
- The following cases are no longer available in the GDC Data Portal. They had no data files associated with them in DR 19 so there are no changes in file availability in this release.
- TARGET-00-NAAENF
- TARGET-00-NAAENG
- TARGET-00-NAAENH
- TARGET-00-NAAENI
- TARGET-00-NAAENJ
- TARGET-00-NAAENK
- TARGET-00-NAAENL
- TARGET-00-NAAENM
- TARGET-00-NAAENN
- TARGET-00-NAAENP
- TARGET-00-NAAENR
- TARGET-00-NAAEPE
A complete list of files for DR19.1 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190917_data_release_19.0_active.txt.gz
- gdc_manifest_20190917_data_release_19.0_legacy.txt.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 19.0
- GDC Product: Data
- Release Date: September 17, 2019
New updates
- New projects released:
- BEATAML1.0-COHORT - Functional Genomic Landscape of Acute Myeloid Leukemia (phs001657)
- Includes WXS and RNA-Seq
- BEATAML1.0-COHORT - Functional Genomic Landscape of Acute Myeloid Leukemia (phs001657)
- New TARGET data released
- TARGET-ALL-P1 RNA-Seq
- TARGET-ALL-P2 RNA-Seq, WXS, and miRNA-Seq
- TARGET-ALL-P3 miRNA-Seq
- TARGET-AML WXS, WGS, and miRNA-Seq
- TARGET-NBL WXS and RNA-Seq
- TARGET-RT WGS and RNA-Seq
- TARGET-WT WGS, WXS, and RNA-Seq
- Additional CGCI-BLGSP WGS data released
- Pindel VCFs released for TARGET-ALL-P2, TARGET-ALL-P3, TARGET-AML, TARGET-NBL, TARGET-WT, MMRF-COMMPASS, HCMI-CMDC, and CPTAC-3
- Disease-specific staging properties for many projects were released
A complete list of files for DR19.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190917_data_release_19.0_active.txt.gz
- gdc_manifest_20190917_data_release_19.0_legacy.txt.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 18.0
- GDC Product: Data
- Release Date: July 8, 2019
New updates
- New Projects released
- MMRF-COMMPASS - Multiple Myeloma CoMMpass Study (phs000748)
- Includes WGS, WXS, and RNA-Seq
- ORGANOID-PANCREATIC - Pancreas Cancer Organoid Profiling (phs001611)
- Includes WGS, WXS, and RNA-Seq
- TARGET-ALL-P1 - Acute Lymphoblastic Leukemia - Phase I (phs000218)
- Includes WGS
- TARGET-ALL-P2 - Acute Lymphoblastic Leukemia - Phase II (phs000218)
- Includes WGS
- CGCI-BLGSP - Burkitt Lymphoma Genome Sequencing Project (phs000235)
- Includes WGS and RNA-Seq
- MMRF-COMMPASS - Multiple Myeloma CoMMpass Study (phs000748)
- New versions of RNA-Seq data for TARGET-ALL-P3
- New RNA-Seq data for TARGET-CCSK
- New RNA-Seq data for TARGET-OS
A complete list of files for DR18.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190708_data_release_18.0_active.txt.gz
- gdc_manifest_20190708_data_release_18.0_legacy.txt.gz
Bugs Fixed Since Last Release
- New versions of RNA-Seq data for TARGET-ALL-P3 resolve issue with missing reads from BAM files.
Known Issues and Workarounds
- Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation"
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- Issues in the Legacy Archive
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (
Data Release 17.1
- GDC Product: Data
- Release Date: June 12, 2019
New updates
- Rebuilt indices for NCICCR-DLBCL and CTSP-DLBCL1. Fewer files viewable in GDC Data Portal or API.
A complete list of files for DR17.1 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190612_data_release_17.1_active.txt.gz
- gdc_manifest_20190612_data_release_17.1_legacy.txt.gz
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- Issues in the Legacy Archive
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Slide barcodes (
Data Release 17.0
- GDC Product: Data
- Release Date: June 5, 2019
New updates
- New Projects released
- HCMI-CMDC - NCI Cancer Model Development for the Human Cancer Model Initiative (HCMI) (phs001486)
- BEATAML1.0-CRENOLANIB - Clinical Resistance to Crenolanib in Acute Myeloid Leukemia Due to Diverse Molecular Mechanisms (phs001628)
- RNA-Seq data for NCICCR-DLBCL and CTSP-DLBCL1 are released
- ATAC-Seq data for TCGA projects are released
- CPTAC-3 RNA-Seq data are released
- Clinical data updates for TCGA - to see parser code updates review API v1.20 release notes
- Clinical data updates for other projects to accommodate migration of vital_status, days_to_birth, and days_to_death from the Diagnosis to the Demographic node
A complete list of files for DR17.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190605_data_release_17.0_active.txt.gz
- gdc_manifest_20190605_data_release_17.0_legacy.txt.gz.
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- TCGA Projects
- Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release. - Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
- TARGET projects
- TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- No data from TARGET-MDLS is available.
- Issues in the Legacy Archive
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Slide barcodes (
Data Release 16.0
- GDC Product: Data
- Release Date: March 26, 2019
New updates
- The CPTAC-3 project (phs001287) is released with WXS and WGS data. RNA-Seq will be released at a later date. Additional project details can be found at on the CPTAC Data Source page.
- TARGET-ALL-P3 (phs000218) WGS BAM files are released.
- VAREPOP-APOLLO (phs001374) VCF files are released.
A complete list of files for DR16.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190326_data_release_16.0_active.txt.gz
- gdc_manifest_20190326_data_release_16.0_legacy.txt.gz.
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- TARGET ALL-P3 RNA-Seq results from DR14 are missing ~18% of reads. Downsampling appears to be completely random and count files have a very high correlation (>99.99%) with complete data. New versions of these files will be created that include the entire set of reads.
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 15.0
- GDC Product: Data
- Release Date: February 20, 2019
New updates
- TARGET-ALL-P3 is now available and includes RNA-Seq and WXS data.
- New RNA-Seq workflow is now being utilized for new projects. More details can be found in the RNA-Seq pipeline documentation.
- New tumor only variant calling pipeline is now being utilized for new projects. More details can be found in the Tumor only pipeline documentation.
A complete list of files for DR15.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20190220_data_release_15.0_active.txt.gz
- gdc_manifest_20190220_data_release_15.0_legacy.txt.gz.
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 14.0
- GDC Product: Data
- Release Date: December 18, 2018
New updates
- Copy Number Variation (CNV) data derived from GISTIC2 results are now available for download for TCGA projects
- New miRNA data available for 181 aliquots for TARGET and TCGA
- Released two SNP6 files (6cd4ef5e-324a-4ace-8779-7a33bd559c83, dfa89ee9-6ee5-460b-bd58-b5ca0e9cb7ac)
- New versions of TCGA biospecimen supplements are available
- Updated primary site for
TCGA-AG-3881
toUnknown
- 8 New Harmonized WGS BAM files for TARGET-WT, TARGET-NBL, TARGET-AML added to the portal
A complete list of files for DR14.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20181218_data_release_14.0_active.txt.gz
- gdc_manifest_20181218_data_release_14.0_legacy.txt.gz.
Bugs Fixed Since Last Release
- FM-AD clinial and biospecimen supplements are now correctly labeled as TSV rather than XLSX
Known Issues and Workarounds
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 13.0
- GDC Product: Data
- Release Date: September 27, 2018
New updates
- Three new projects are released to the GDC (VAREPOP-APOLLO (phs001374), CTSP-DLBCL1 (phs001184), NCICCR-DLBCL (phs001444)
- TARGET WGS alignments are released. VCFs will be provided in a later release
- Clinical data was harmonized with ICD-O-3 terminology for TCGA properties case.primary_site, case.disease_type, diagnosis.primary_diagnosis, diagnosis.site_of_resection_or_biopsy, diagnosis.tissue_or_organ_of_origin
- Redaction annotations applied to 11 aliquots in TCGA-DLBC
- Redaction annotations applied to incorrectly trimmed miRNA file in the Legacy Achive
A complete list of files for DR13.0 are listed for the GDC Data Portal and the GDC Legacy Archive are found below:
- gdc_manifest_20180927_data_release_13.0_active.txt.gz
- gdc_manifest_20180927_data_release_13.0_legacy.txt.gz.
Bugs Fixed Since Last Release
- 253 files Copy Number Segment and Masked Copy Number Segment files were released. These were skipped in DR 12.0
- 36 Diagnostic TCGA slides were released. They were skipped in DR 12.0
Known Issues and Workarounds
- 506 Copy Number Segment and 36 Slide Image files are designated as controlled-access on the GDC Data Portal. These files are actually open-access and will be downloadable without a token using this manifest.
- 2 Copy Number Segment files from TCGA-TGCT do not appear on the GDC Portal. They can be downloaded using the Data Transfer Tool using the following UUIDs.
- 6cd4ef5e-324a-4ace-8779-7a33bd559c83 - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.nocnv_grch38.seg.v2.txt
- dfa89ee9-6ee5-460b-bd58-b5ca0e9cb7ac - RAMPS_p_TCGA_Batch_430_NSP_GenomeWideSNP_6_E07_1538238.grch38.seg.v2.txt
- TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub.
- TARGET-20-PASJGZ-04A-02D
- TARGET-30-PAPTLY-01A-01D
- TARGET-20-PAEIKD-09A-01D
- TARGET-20-PASMYS-14A-02D
- TARGET-20-PAMYAS-14A-02D
- TARGET-10-PAPZST-09A-01D
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 12.0
- GDC Product: Data
- Release Date: June 13, 2018
New updates
- Updated clinical and biospecimen XML files for TCGA cases are available in the GDC Data Portal. Equivalent Legacy Archive files may no longer be up to date.
- All biospecimen and clinical supplement files for TCGA projects formerly only found in the Legacy Archive have been updated and transferred to the GDC Data Portal. Equivalent Legacy Archive files and metadata retrieved from the API may no longer be up to date.
- Diagnostic slides from TCGA are now available in the GDC Data Portal and Slide Image Viewer. They were formerly only available in the Legacy Archive.
- Updated Copy Number Segment and Masked Copy Number Segment files are now available. These were generated using an improved mapping of hg38 coordinates for the Affymetrix SNP6.0 probe set.
- VCF files containing SNVs produced from TARGET WGS CGI data are available. The variant calls were initially produced by CGI and lifted over to hg38.
Updated files for this release are listed here. A complete list of files for DR12.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.
Bugs Fixed Since Last Release
- TARGET NBL RNA-Seq data is now associated with the correct aliquot.
Known Issues and Workarounds
- Some Copy Number Segment and Masked Copy Number Segment were not replaced in DR 12.0. 253 files remain to be swapped in a later release
- Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release
- 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled.
- 36 Diagnostic TCGA slides are not yet available in the active GDC Portal. They are still available in the GDC Legacy Archive.
- 11 BAM files for TARGET-NBL RNA-Seq are not available in the GDC Data portal
- Two tissue slide images are unavailable for download from GDC Data Portal
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
are not available. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 11.0
- GDC Product: Data
- Release Date: May 21, 2018
New updates
- Updated miRNA files to remove QCFail reads. This included all BAM and downstream count files.
- TCGA Tissue slide images now available in GDC Data Portal. Previously these were found only in the Legacy Archive
Updated files for this release are listed here. A complete list of files for DR11.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.
Bugs Fixed Since Last Release
- N/A
Known Issues and Workarounds
- Two tissue slide images are unavailable for download from GDC Data Portal
- RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- miRNA alignments include QC failed reads.
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 10.1
- GDC Product: Data
- Release Date: February 15, 2018
New updates
- Updated FM-AD clinical data to conform with Data Dictionary release v1.11
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- RNA-Seq files for TARGET-NBL are attached to the incorrect aliquot. The BAM files contain the correct information in their header but the connection in the GDC to read groups and aliquots is incorrect. The linked file below contains a mapping between aliquots where file are currently associated and the aliquot where they should instead be associated (mapping file).
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- miRNA alignments include QC failed reads.
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 10.0
- GDC Product: Data
- Release Date: December 21, 2017
New updates
- New TARGET files for all projects
- TARGET updates for clinical and biospecimen data
- Replace corrupted .bai files
- Update TCGA and TARGET MAF files to include VarScan2 indels and more information in all_effects column
- Update VarScan VCF files
Updated files for this release are listed here. A complete list of files for DR10.0 are listed for the GDC Data Portal here and the GDC Legacy Archive here.
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- The raw and annotated VarScan VCF files for aliquot
TCGA-VR-A8ET-01A-11D-A403-09
were not replaced in DR10.0 and thus do not contain indels. However, the indels from this aliquot can be found in the MAF files and are displayed in the Exploration section in the Data Portal. These VCFs files will be replaced in a later release.
- There are 5051 TARGET files for which
experimental_strategy
,data_format
,platform
, anddata_subtype
are blank - There are two cases with identical submitter_id
TARGET-10-PARUYU
- TARGET-MDLS cases do not have disease_type or primary_site populated
- Some TARGET cases are missing
days_to_last_follow_up
- Some TARGET cases are missing
age_at_diagnosis
- Some TARGET files are not connected to all related aliquots
- miRNA alignments include QC failed reads.
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 9.0
- GDC Product: Data
- Release Date: October 24, 2017
New updates
- Foundation Medicine Data Release
- This includes controlled-access VCF and MAF files as well as clinical and biospecimen supplements and metadata.
- Original Foundation Medicine supplied data can be found on the Foundation Medicine Project Page.
- Updated RNA-Seq data for TARGET NBL
- Includes new BAM and count files
Updated files for this release are listed here. A complete list of files for DR9.0 are listed here.
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- miRNA alignments include QC failed reads.
- Samples of TARGET sample_type
Recurrent Blood Derived Cancer - Bone Marrow
are mislabeled asRecurrent Blood Derived Cancer - Peripheral Blood
. A workaround is to look at the sample barcode, which is -04 forRecurrent Blood Derived Cancer - Bone Marrow
. (e.g.TARGET-20-PAMYAS-04A-03R
) - FM-AD clinical and biospecimen supplement files have incorrect data format. They are listed as XLSX, but are in fact TSV files.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 8.0
- GDC Product: Data
- Release Date: August 22, 2017
New updates
- Released updated miRNA quantification files to address double counting of some normalized counts described in DR7.0 release notes.
Updated files for this release are listed here. A Complete list of files for DR8.0 are listed here.
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MDLS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 7.0
- GDC Product: Data
- Release Date: June 29, 2017
New updates
- Updated public Mutation Annotation Format (MAF) files are now available. Updates include filtering to remove variants impacted by OxoG artifacts and those impacted by strand bias.
- Protected MAF files are updated to include flags for OxoG and strand bias.
- Annotated VCFs are updated to include flags for OxoG artifacts and strand bias.
Updated files for this release are listed here. A Complete list of files for DR7.0 are listed here
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- TARGET-NBL RNA-Seq files were run as single ended even though they are derived from paired-end data. These files will be rerun through the GDC RNA-Seq pipelines in a later release. Impacted files can be found here. Downstream count files are also affected. Users may access original FASTQ files in the GDC Legacy Archive, which are not impacted by this issue.
- Reads that are mapped to multiple genomic locations are double counted in some of the GDC miRNA results. The GDC will release updated files correcting the issue in an upcoming release. The specific impacts are described further below:
- Isoform Expression Quantification files
- Raw reads counts are accurate
- Normalized counts are proportionally skewed (r^2=1.0)
- miRNA Expression Quantification files
- A small proportion of miRNA counts are overestimated (mean r^2=0.9999)
- Normalized counts are proportionally skewed (mean r^2=0.9999)
- miRNA BAM files
- no impact
- Isoform Expression Quantification files
- Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant.
- Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers.
- The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the TARGET Data Matrix. Data that is not present or is not the most up to date includes:
- All microarray data and metadata
- All sequencing analyzed data and metadata
- 1180 of 12063 sequencing runs of raw data
- Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS.
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MLDS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Data Release 6.0
- GDC Product: Data
- Release Date: May 9, 2017
New updates
- GDC updated public Mutation Annotation Format (MAF) files are now available. Updates include leveraging the MC3 variant filtering strategy, which results in more variants being recovered relative to the previous version. A detailed description of the new format can be found here.
- Protected MAFs are updated to include additional variant annotation information
- Some MuTect2 VCFs updated to include dbSNP and COSMIC annotations found in other VCFs
Updated files for this release are listed here.
Bugs Fixed Since Last Release
None
Known Issues and Workarounds
- There are 11 cases in project TCGA-DLBC that are known to have incorrect WXS data in the GDC Data Portal. Impacted cases are listed below. This affects the BAMs and VCFs associated with these cases in the GDC Data Portal. Corrected BAMs can be found in the GDC Legacy Archive. Variants from affected aliquots appear in the protected MAFs with GDC_FILTER=ContEst to indicate a sample contamination problem, but are removed during the generation of the Somatic MAF file. In a later release we will supply corrected BAM, VCF, and MAF files for these cases. In the mean time, we advise you not to use any of the WXS files associated with these cases in the GDC Data Portal. A list of these files can be found here. Download list of affected files.
- TCGA-FF-8062
- TCGA-FM-8000
- TCGA-G8-6324
- TCGA-G8-6325
- TCGA-G8-6326
- TCGA-G8-6906
- TCGA-G8-6907
- TCGA-G8-6909
- TCGA-G8-6914
- TCGA-GR-7351
- TCGA-GR-7353
- Variants found in VCF and MAF files may contain OxoG artifacts, which are produced during library preparation and may result in the apparent substitutions of C to A or G to T in certain sequence contexts. In the future we will plan to label potential oxoG artifacts in the MAF files.
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Some validated somatic mutations may not be present in open-access MAF files. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- No data from TARGET-MLDS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Details are provided in Data Release Manifest
Data Release 5.0
- GDC Product: Data
- Release Date: March 16, 2017
New updates
- Additional annotations from TCGA DCC are available
- Complete list of updated TCGA files is found here
- Clinical data added for TARGET ALL P1 and P2
- Pathology reports now have submitter IDs as assigned by the BCR
- TARGET Data refresh
- Most recent biospecimen and clinical information from the TARGET DCC. New imported files are listed here
- Updated indexed biospecimen and clinical metadata
- Updated SRA XMLs files
- Does not include updates to TARGET NBL
Bugs Fixed Since Last Release
- Missing cases from TCGA-LAML were added to Legacy Archive
- Biotab files are now linked to Projects and Cases in Legacy Archive
Known Issues and Workarounds
- Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found here.
- Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in MC3 or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
- TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
- No data from TARGET-MLDS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- Two biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
- Tumor grade property is not populated
- Progression_or_recurrence property is not populated
Details are provided in Data Release Manifest
Data Release 4.0
- GDC Product: Data
- Release Date: October 31, 2016
New updates
- TARGET ALL P1 and P2 biospecimen and molecular data are now available in the Legacy Archive. Clinical data will be available in a later release.
- Methylation data from 27k/450k Arrays has been lifted over to hg38 and is now available in the GDC Data Portal
- Public MAF files are now available for VarScan2, MuSE, and SomaticSniper. MuTect2 MAFs were made available in a previous release.
- Updated VCFs and MAF files are available for MuTect2 pipeline to compensate for WGA-related false positive indels. See additional information on that change here. A listing of replaced files is provided here.
- Added submitter_id for Pathology Reports in Legacy Archive
Bugs Fixed Since Last Release
- None
Known Issues and Workarounds
- Some validated somatic mutations may not be present in open-access MAF files. When creating open-access MAF files from the protected versions we are extremely conservative in removing potential germline variants. Our approach is to remove all mutations that are present in dbSNP. In a subsequent release we will provide updated open-access MAF files, which preserve variants found in COSMIC or a TCGA validation study. Please review the protected MAF files in the GDC Data Portal if you are unable to find your mutation in the open-access files.
- Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
- TARGET-AML is undergoing reorganization. Pending reorganization, cases from this projects may not contain many clinical, biospecimen, or genomic data files.
- No data from TARGET-MLDS is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
- Biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Details are provided in Data Release Manifest
Data Release 3.0
- GDC Product: Data
- Release Date: September 16, 2016
New updates
- CCLE data now available (in the Legacy Archive only)
- BMI calculation is corrected
- Slide is now categorized as a Biospecimen entity
Bugs Fixed Since Last Release
- BMI calculation is corrected
Known Issues and Workarounds
- Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. Whether a sample underwent this process can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
- MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
- TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
- No data from TARGET-PPTP is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
- Biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Details are provided in Data Release Manifest
Data Release 2.0
- GDC Product: Data
- Release Date: August 9, 2016
New updates
- Additional data, previously available via CGHub and the TCGA DCC, is now available in the GDC
- Better linking between files and their associated projects and cases in the Legacy Archive
- MAF files are now available in the GDC Data Portal
Known Issues and Workarounds
- Insertions called for tumor samples that underwent whole genome amplification may be of lower quality. These are present in VCF and MAF files produced by the MuTect2 variant calling pipeline. This information can be found in the analyte_type property within analyte and aliquot. TCGA analyte type can be also identified in the 20th character of TCGA barcode, at which "W" corresponds to WGA.
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
- MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
- TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
- No data from TARGET-PPTP is available.
- Slide barcodes (
submitter_id
values for Slide entities in the Legacy Archive) are not available - SDF Files are not linked to Project or Case in the Legacy Archive
- There are 200 cases from TCGA-LAML that do not appear in the Legacy Archive
- Biotab files are not linked to Project or Case in the Legacy Archive
- SDRF files are not linked to Project or Case in the Legacy Archive
- Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg
Details are provided in Data Release Manifest
Initial Data Release (1.0)
- GDC Product: Data
- Release Date: June 6, 2016
Available Program Data
- The Cancer Genome Atlas (TCGA)
- Therapeutically Applicable Research To Generate Effective Treatments (TARGET)
Available Harmonized Data
- WXS
- Co-cleaned BAM files aligned to GRCh38 using BWA
- mRNA-Seq
- BAM files aligned to GRCh38 using STAR 2-pass strategy
- Expression quantification using HTSeq
- miRNA-Seq
- BAM files aligned to GRCh38 using BWA aln
- Expression quantification using BCCA miRNA Profiling Pipeline*
- Genotyping Array
- CNV segmentation data
Known Issues and Workarounds
- BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification.
- All legacy files for TCGA are available in the GDC Legacy Archive, but not always linked back to cases depending on available metadata.
- Public MAFs (those with germline variants removed) are only available for MuTect2 pipeline. MAFs for other pipelines are forthcoming.
- TARGET-AML and TARGET-ALL projects are undergoing reorganization. Pending reorganization, cases from these projects may not contain many clinical, biospecimen, or genomic data files.
- No data from TARGET-PPTP is available.
- Legacy data not available in harmonized form:
- Annotated VCF files from TARGET, anticipated in future data release
- TCGA data that failed harmonization or QC or have been newly updated in CGHub: ~1.0% of WXS aliquots, ~1.6% of RNA-Seq aliquots
- TARGET data that failed harmonization or QC, have been newly updated in CGHub, or whose project names are undergoing reorganization: ~76% of WXS aliquots, ~49% of RNA-Seq aliquots, ~57% of miRNA-Seq.
- MAF Column #109 "FILTER" entries are separated by both commas and semi-colons.
- MAFs are not yet available for query or search in the GDC Data Portal or API. You may download these files using the following manifests, which can be passed directly to the Data Transfer Tool. Links for the open-access TCGA MAFs are provided below for downloading individual files.
Details are provided in Data Release Manifest
Download Open-access MAF files
- Please note that these links no longer point to files and will be updated in the future.
TCGA.ACC.mutect.somatic.maf.gz
TCGA.BLCA.mutect.somatic.maf.gz
TCGA.BRCA.mutect.somatic.maf.gz
TCGA.CESC.mutect.somatic.maf.gz
TCGA.CHOL.mutect.somatic.maf.gz
TCGA.COAD.mutect.somatic.maf.gz
TCGA.DLBC.mutect.somatic.maf.gz
TCGA.ESCA.mutect.somatic.maf.gz
TCGA.GBM.mutect.somatic.maf.gz
TCGA.HNSC.mutect.somatic.maf.gz
TCGA.KICH.mutect.somatic.maf.gz
TCGA.KIRC.mutect.somatic.maf.gz
TCGA.KIRP.mutect.somatic.maf.gz
TCGA.LAML.mutect.somatic.maf.gz
TCGA.LGG.mutect.somatic.maf.gz
TCGA.LIHC.mutect.somatic.maf.gz
TCGA.LUAD.mutect.somatic.maf.gz
TCGA.LUSC.mutect.somatic.maf.gz
TCGA.MESO.mutect.somatic.maf.gz
TCGA.OV.mutect.somatic.maf.gz
TCGA.PAAD.mutect.somatic.maf.gz
TCGA.PCPG.mutect.somatic.maf.gz
TCGA.PRAD.mutect.somatic.maf.gz
TCGA.READ.mutect.somatic.maf.gz
TCGA.SARC.mutect.somatic.maf.gz
TCGA.SKCM.mutect.somatic.maf.gz
TCGA.STAD.mutect.somatic.maf.gz
TCGA.TGCT.mutect.somatic.maf.gz
TCGA.THCA.mutect.somatic.maf.gz
TCGA.THYM.mutect.somatic.maf.gz
TCGA.UCEC.mutect.somatic.maf.gz
TCGA.UCS.mutect.somatic.maf.gz
TCGA.UVM.mutect.somatic.maf.gz