GDC VCF Format
Introduction
The GDC DNA-Seq somatic variant-calling pipeline compares a set of matched tumor/normal alignments and produces a VCF file. VCF files report the somatic variants that were detected by each of the four variant callers. Four raw VCFs (Data Type: Raw Simple Somatic Mutation) are produced for each tumor/normal pair of BAMs. Four additional annotated VCFs (Data Type: Annotated Somatic Mutation) are produced by adding biologically relevant information about each variant.
The GDC VCF file format follows standards of the Variant Call Format (VCF) Version 4.1 Specification. Raw Simple Somatic Mutation VCF files are unannotated, whereas Annotated Somatic Mutation VCF files include extensive, consistent, and pipeline-agnostic annotation of somatic variants.
VCF file structure
Metadata header
A VCF file starts with lines of metadata that begin with ##
. Some key components of this section include:
- gdcWorkflow: Information on the pipelines that were used by the GDC to generate the VCF file. Annotated VCF files contain two gdcWorkflow lines, one that reports the variant calling process and one that reports the variant annotation process.
- INDIVIDUAL: information about the study participant (
case
), including:- NAME: Submitter ID (barcode) associated with the participant
- ID: GDC case UUID
- SAMPLE: sample information, including:
- ID: NORMAL or TUMOR
- NAME: Submitter ID (barcode) of the aliquot
- ALIQUOT_ID: GDC aliquot UUID
- BAM_ID: The UUID for the BAM file used to produce the VCF
- INFO: Format of additional information fields
- NOTE: GDC Annotated VCFs may contain multiple INFO lines. The last INFO line contains information about annotation fields generated by the Somatic Annotation Workflow (see GDC INFO Fields below).
- FILTER: Description of filters that have been applied to the variants
- FORMAT: Description of genotype fields
- reference: The reference genome used to generate the VCF file (GRCh38.d1.vd1.fa)
- contig: A list of IDs for the contiguous DNA sequences that appear in the reference genome used to produce VCF files
- NOTE: Annotated VCFs include contig information for autosomes, sex chromosomes, and mitochondrial DNA. Unplaced, unlocalized, human decoy, and viral genome sequences are not included.
- VEP: the VEP command used by the Somatic Annotation Workflow to generate the annotated VCF file.
Column Header Line
Each variant is represented by a row in the VCF file. Below each of the columns are described:
- CHROM: The chromosome on which the variant is located
- POS: The position of the variant on the chromosome. Refers to the first position if the variant includes more than one base
- ID: A unique identifier for the variant; usually a dbSNP rs number if applicable
- REF: The base(s) exhibited by the reference genome at the variant's position
- ALT: The alternate allele(s), comma-separated if there are more than one
- QUAL: Not populated
- FILTER: The names of the filters that have flagged this variant. The types of filters used will depend on the variant caller used.
- INFO: Additional information about the variant. This includes the annotation applied by the VEP.
- FORMAT: The format of the sample genotype data in the next two columns. This includes descriptions of the colon-separated values.
- NORMAL: Colon-separated values that describe the normal sample
- TUMOR: Colon-separated values that describe the tumor sample
See Variant Call Format (VCF) Version 4.1 Specification for details.
GDC INFO fields
The following variant annotation fields are currently included in Annotated Somatic Mutation VCF files. Please refer to the DNA-Seq Analysis Pipeline documentation for details on how this information is generated. VEP Documentation provides additional information about some of these fields.
Field | Description |
---|---|
Allele | The variant allele used to calculate the consequence |
Consequence | Consequence type of this variant |
IMPACT | The impact modifier for the consequence type |
SYMBOL | The HUGO gene symbol |
Gene | Ensembl stable ID of the affected gene |
Feature_type | Type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature. |
Feature | Ensembl stable ID of the feature |
BIOTYPE | The type of transcript or regulatory feature (e.g. protein_coding) |
EXON | Exon number (out of total exons) |
INTRON | Intron number (out of total introns) |
HGVSc | The HGVS coding sequence name |
HGVSp | The HGVS protein sequence name |
cDNA_position | Relative position of base pair in cDNA sequence |
CDS_position | Relative position of base pair in coding sequence |
Protein_position | Relative position of the affected amino acid in protein |
Amino_acids | Change in amino acids (only given if the variant affects the protein-coding sequence) |
Codon | The affected codons with the variant base in upper case |
Existing_variation | Known identifier of existing variant; usually a dbSNP rs number if applicable |
ALLELE_NUM | Allele number from input; 0 is reference, 1 is first alternate, etc. |
DISTANCE | Shortest distance from variant to transcript |
STRAND | The DNA strand (1 or -1) on which the transcript/feature lies |
FLAGS | Transcript quality flags |
VARIANT_CLASS | Sequence Ontology variant class |
SYMBOL_SOURCE | The source of the gene symbol |
HGNC_ID | HGNC gene ID |
CANONICAL | A flag indicating if the transcript is denoted as the canonical transcript for this gene |
TSL | Transcript support level |
APPRIS | APPRIS isoform annotation |
CCDS | The CCDS identifer for this transcript, where applicable |
ENSP | The Ensembl protein identifier of the affected transcript |
SWISSPROT | UniProtKB/Swiss-Prot identifier of protein product |
TREMBL | UniProtKB/TrEMBL identifier of protein product |
UNIPARC | UniParc identifier of protein product |
RefSeq | RefSeq gene ID |
GENE_PHENO | Indicates if the gene is associated with a phenotype, disease or trait |
SIFT | The SIFT prediction and/or score, with both given as prediction (score) |
PolyPhen | The PolyPhen prediction and/or score |
DOMAINS | The source and identifier of any overlapping protein domains |
HGVS_OFFSET | Indicates by how many bases the HGVS notations for this variant have been shifted |
GMAF | Non-reference allele and frequency of existing variant in 1000 Genomes |
AFR_MAF | Non-reference allele and frequency of existing variant in 1000 Genomes combined African population |
AMR_MAF | Non-reference allele and frequency of existing variant in 1000 Genomes combined American population |
EAS_MAF | Non-reference allele and frequency of existing variant in 1000 Genomes combined East Asian population |
EUR_MAF | Non-reference allele and frequency of existing variant in 1000 Genomes combined European population |
SAS_MAF | Non-reference allele and frequency of existing variant in 1000 Genomes combined South Asian population |
AA_MAF | Non-reference allele and frequency of existing variant in NHLBI-ESP African American population |
EA_MAF | Non-reference allele and frequency of existing variant in NHLBI-ESP European American population |
ExAC_MAF | Frequency of existing variant in ExAC combined population |
ExAC_Adj_MAF | Adjusted frequency of existing variant in ExAC combined population |
ExAC_AFR_MAF | Frequency of existing variant in ExAC African/American population |
ExAC_AMR_MAF | Frequency of existing variant in ExAC American population |
ExAC_EAS_MAF | Frequency of existing variant in ExAC East Asian population |
ExAC_FIN_MAF | Frequency of existing variant in ExAC Finnish population |
ExAC_NFE_MAF | Frequency of existing variant in ExAC Non-Finnish European population |
ExAC_OTH_MAF | Frequency of existing variant in ExAC combined other combined populations |
ExAC_SAS_MAF | Frequency of existing variant in ExAC South Asian population |
CLIN_SIG | Clinical significance of variant from dbSNP |
SOMATIC | Somatic status of existing variant(s) |
PHENO | Indicates if existing variant is associated with a phenotype, disease or trait |
PUBMED | Pubmed ID(s) of publications that cite existing variant |
MOTIF_NAME | The source and identifier of a transcription factor binding profile aligned at this position |
MOTIF_POS | The relative position of the variation in the aligned TFBP |
HIGH_INF_POS | A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP) |
MOTIF_SCORE_CHANGE | The difference in motif score of the reference and variant sequences for the TFBP |
ENTREZ | Entrez ID |
EVIDENCE | Evidence that the variant exists |