Mutation Annotation Format (MAF) - Legacy TCGA Specification
This definition was taken from the previously public wiki hosted by TCGA and reflects the MAF format that was available during the active period of the TCGA project.
Document Information
The spec has been reverted to the June 26th version (version 20). Additional changes are the removal of the "under construction" banner, changing all text to black, and fixing a typo in the link to the MAF 2.2 specification.
Specification for Mutation Annotation Format
Version 2.4.1
June 20, 2014
Contents
-
1 Current version changes
-
2 About MAF specifications
-
2.1 Definition of open access MAF data
-
2.2 Somatic MAF vs. Protected MAF
-
-
3 MAF file fields
- 3.1 Table 1 - File column headers
-
4 MAF file checks
-
5 MAF naming convention
-
6 Previous specification versions
Current version changes
This current revision is version 2.4.1 of the Mutation Annotation Format (MAF) specification.
The following items in the specification were added or modified in version 2.4.1 from version 2.4:
-
Header for MAF file is "#version 2.4.1"
-
"Somatic" and "None" are the only acceptable values for "Mutation_Status" for a somatic.MAF (named .somatic.maf). When Mutation_Status is None, Validation_Status must be Invalid.
-
Centers need to make sure that Mutations_Status "None" doesn't include germline mutation.
-
For a somatic MAF, following rules should be satisfied:
SOMATIC = (A AND (B OR C OR D)) OR (E AND F)
A: Mutation_Status == "Somatic"
B: Validation_Status == "Valid"
C. Verification_Status == "Verified"
D. Variant_Classification is not {Intron, 5'UTR, 3'UTR, 5'Flank, 3'Flank, IGR}, which implies that Variant_Classification can only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region}.
E: Mutations_status == "None"
F: Validation_status == "Invalid" -
Extra validation rules: If Validation_Status == Valid or Invalid, then Validation_Method != none (case insensitive).
About MAF specifications
Mutation annotation files should be transferred to the DCC. Those files should be formatted using the mutation annotation format (MAF) that is described below. File naming convention is also below.
Following categories of somatic mutations are reported in MAF files:
-
Missense and nonsense
-
Splice site, defined as SNP within 2 bp of the splice junction
-
Silent mutations
-
Indels that overlap the coding region or splice site of a gene or the targeted region of a genetic element of interest
-
Frameshift mutations
-
Mutations in regulatory regions
Definition of open access MAF data
A large proportion of MAFs are submitted as discovery data and sites labeled as somatic in these files overlap with known germline variants. In order to minimize germline contamination in putative (unvalidated) somatic calls, certain filtering criteria have been imposed. Based on current policy, open access MAF data should:
-
include all validated somatic mutation calls
-
include all unvalidated somatic mutation calls that overlap with a coding region or splice site
-
exclude all other types of mutation calls (i.e., non-somatic calls (validated or not), unvalidated somatic calls that are not in coding region or splice sites, and dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM)
Somatic MAF vs. Protected MAF
Centers will submit to the DCC MAF archives that contain Somatic MAF (named.somatic.maf) for open access data and an all-inclusive Protected MAF (named.protected.maf) that does not filter any data out and represents the original super-set of mutation calls. The files will be formatted using the Mutation Annotation Format (MAF).
The following table lists some of the critical attributes of somatic and protected MAF files and provides a comparison.
Attribute | Somatic MAF | Protected MAF |
---|---|---|
File naming | Somatic MAFs should be named as*.somatic.mafand cannot contain 'germ' or 'protected' in file name. | Protected MAFs should be named as*.protected.mafand should not contain 'somatic' in the file name. |
Mutation category | Somatic MAFs can only contain entries whereMutation_Statusis "Somatic". If any other value is assigned to the field, the archive will fail. Experimentally validated or unvalidated (see next row) somatic mutations can be included in the file. | There is no such restriction for protected MAF. The file should contain all mutation calls including those from which .somatic.maf is derived. |
Filtering criteria | In order to minimize germline contamination, somatic MAFs can contain unvalidated somatic mutations only from coding regions and splice sites, which implies: | There are no such constraints for mutations in protected MAF. |
If Validation_Status is"Unknown",V a riant_Classification cannot be 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, or Intron.Variant_Classificationcan only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame\}. | ||
There is no such constraint for experimentally validated (Validation_Statusis "Valid") somatic mutations. | ||
dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM must be removed from somatic MAFs. | ||
Access level | These files are deployed as open access data. | These files are deployed as protected data. |
MAF file fields
The format of a MAF file is tab-delimited columns. Those columns are described in Table 1 and are required in every MAF file. The order of the columns will be validated by the DCC. Column headers and values are case sensitive where specified. Columns may allow null values (i.e._ blank cells) and/or have enumerated values. The validator looks for a header stating the version of the specification to validate against (e.g. #version 2.4). If not, validation fails. Any columns that come after the columns described in Table 1 are optional. Optional columns are not validated by the DCC and can be in any order.
Table 1 - File column headers
Index | MAF Column Header | Description of Values | Example | Case Sensitive | Null | Enumerated |
---|---|---|---|---|---|---|
1 | Hugo_Symbol | HUGO symbol for the gene (HUGO symbols are always in all caps). If no gene exists within 3kb enter "Unknown". | EGFR | Yes | No | Set or Unknown |
Source: http://genenames.org | ||||||
2 | Entrez_Gene_Id | Entrez gene ID (an integer). If no gene exists within 3kb enter "0". | 1956 | No | No | Set |
Source: http://ncbi.nlm.nih.gov/sites/entrez?db=gene | ||||||
3 | Center | Genome sequencing center reporting the variant. If multiple institutions report the same mutation separate list using semicolons. Non-GSC centers will be also supported if center name is an accepted center name. | hgsc.bcm.edu;genome.wustl.edu | Yes | No | Set |
4 | NCBI_Build | Any TGCA accepted genome identifier. Can be string, integer or a float. | hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37, | No | No | Set and Enumerated. |
5 | Chromosome | Chromosome number without "chr" prefix that contains the gene. | X, Y, M, 1, 2, etc. | Yes | No | Set |
6 | Start_Position | Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate (1-based coordinate system). | 999 | No | No | Set |
7 | End_Position | Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system). | 1000 | No | No | Set |
8 | Strand | Genomic strand of the reported allele. Variants should always be reported on the positive genomic strand. (Currently, only the positive strand is an accepted value). | + | No | No | + |
9 | Variant_Classification | Translational effect of variant allele. | Missense_Mutation | Yes | No | Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR (See Notes Section #1) , Intron, RNA, Targeted_Region |
10 | Variant_Type | Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP but for 3 consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of 4 or more. | INS | Yes | No | SNP, DNP, TNP, ONP, INS, DEL, or Consolidated (See Notes Section #2) ) |
11 | Reference_Allele | The plus strand reference allele at this position. Include the sequence deleted for a deletion, or "-" for an insertion. | A | Yes | No | A,C,G,T and/or - |
12 | Tumor_Seq_Allele1 | Primary data genotype. Tumor sequencing (discovery) allele 1. " -" for a deletion represent a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | C | Yes | No | A,C,G,T and/or - |
13 | Tumor_Seq_Allele2 | Primary data genotype. Tumor sequencing (discovery) allele 2. " -" for a deletion represents a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases. | G | Yes | No | A,C,G,T and/or - |
14 | dbSNP_RS | Latest dbSNP rs ID (dbSNP_ID) or "novel" if there is no dbSNP record. source: http://ncbi.nlm.nih.gov/projects/SNP/ | rs12345 | Yes | Yes | Set or "novel" |
15 | dbSNP_Val_Status | dbSNP validation status. Semicolon- separated list of validation statuses. | by2Hit2Allele;byCluster | No | Yes | by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele (See Notes Section #3) Note that "none" will no longer be an acceptable value. |
16 | Tumor_Sample_Barcode | BCR aliquot barcode for the tumor sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID. | TCGA-02-0021-01A-01D-0002-04 | Yes | No | Set |
17 | Matched_Norm_Sample_Barcode | BCR aliquot barcode for the matched normal sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID; e.g. TCGA-02-0021-10A-01D-0002-04 (compare portion ID '10A' normal sample, to '01A' tumor sample). | TCGA-02-0021-10A-01D-0002-04 | Yes | No | Set |
18 | Match_Norm_Seq_Allele1 | Primary data. Matched normal sequencing allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | T | Yes | Yes | A,C,G,T and/or - |
19 | Match_Norm_Seq_Allele2 | Primary data. Matched normal sequencing allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | ACGT | Yes | Yes | A,C,G,T and/or - |
20 | Tumor_Validation_Allele1 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | - | Yes | Yes | A,C,G,T and/or - |
21 | Tumor_Validation_Allele2 | Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | A | Yes | Yes | A,C,G,T and/or - |
22 | Match_Norm_Validation_Allele1 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | C | Yes | Yes | A,C,G,T and/or - |
23 | Match_Norm_Validation_Allele2 | Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases. | G | Yes | Yes | A,C,G,T and/or - |
24 | Verification_Status (See Notes Section #4) | Second pass results from independent attempt using same methods as primary data source. Generally reserved for 3730 Sanger Sequencing. | Verified | Yes | Yes | Verified, Unknown |
25 | Validation_Status (See Notes Section #5) | Second pass results from orthogonal technology. | Valid | Yes | No | Untested, Inconclusive, Valid, Invaild |
26 | Mutation_Status | Updated to reflect validation or verification status and to be in agreement with the VCF VLS field. The values allowed in this field are constrained by the value in the Validation_Status field. | Somatic | Yes | No | Validation_Status values: Untested, Inconslusive, Valid, Invalid - Allowed Mutations_Status Values for Untested and Inconclusive: (See Notes Seciton #6) None, Germline, Somatic, LOH, Post-transcriptional modification Unknown Allowed Mutation_status Values for Valid: (See Notes Seciton #6) Germline, Somatic, LOH, Post-transcriptional modification, Unknown - Allowed Mutations_Status Values for Invalid: (See Notes Seciton #6) none |
27 | Sequencing_Phase | TCGA sequencing phase. Phase should change under any circumstance that the targets under consideration change. | Phase_I | No | Yes | No |
28 | Sequence_Source | Molecular assay type used to produce the analytes used for sequencing. Allowed values are a subset of the SRA 1.5 library_strategy field values. This subset matches those used at CGHub. | WGS;WXS | Yes | No | Common TCGA values: WGS, WGA, WXS, RNA-Seq, miRNA-Seq, Bisulfite-Seq, VALIDATION, Other Other allowed values (per SRA 1.5) ncRNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, Tn-Seq, FAIRE-seq, SELEX, RIP-Seq, ChIA-PET |
29 | Validation_Method | The assay platforms used for the validation call. Examples: Sanger_PCR_WGA, Sanger_PCR_gDNA, 454_PCR_WGA, 454_PCR_gDNA; separate multiple entries using semicolons. | Sanger_PCR_WGA;Sanger_PCR_gDNA | No | NO. If Validation_Status = Untested then "none" If Validation_Status = Valid or Invalid, then not "none" (case insensitive) | No |
30 | Score | Not in use. | NA | No | Yes | No |
31 | BAM_File | Not in use. | NA | No | Yes | No |
32 | Sequencer | Instrument used to produce primary data. Separate multiple entries using semicolons. | Illumina GAIIx;SOLID | Yes | No | Illumina GAIIx, Illumina HiSeq, SOLID, 454, ABI 3730xl, Ion Torrent PGM, Ion Torrent Proton, PacBio RS, Illumina MiSeq, Illumina HiSeq 2500, 454 GS FLX Titanium, AB SOLiD 4 System |
33 | Tumor_Sample_UUID | BCR aliquot UUID for tumor sample | 550e8400-e29b-41d4-a716-446655440000 | Yes | No | |
34 | Matched_Norm_Sample_UUID | BCR aliquot UUID for matched normal | 567e8487-e29b-32d4-a716-446655443246 | Yes | No |
Notes
1 Intergenic Region.
2 Consolidationd is used to indicate a site that was initially reported as a variant but subsequently removed from further analysis because it was consolidated into a new variant. For example, a SNP variant incorporated into a TNP variant.
3 Used when the discovered varieant differs from that of dbSNP.
4 These MAF headers describe the technology that was used to confirm a mutation, whether the same technology ("verification") or a different technology ("validation") is used to prove that a variant is germline or a somatic mutation.
5 These MAF headers describe the technology that was used toconfirm a mutation, whether the same technology (verification) or a different technology (validation) is used to prove that a variant is germline or a somatic mutation.
6 Explanation of some Validation Status-Mutation Status combinations.
Validation Status | Mutation Status | Explanation |
---|---|---|
Valid | Unknown | a valid variant with unknown somatic status due to lack of data from matched normal tissue. |
Invalid | None | validation attempted, tumor and normal are homozygous reference (formerly described as Wildtype) |
Inconclusive | Unknown | validation failed, neither the genotype nor its somatic status is certain due to lack of data from matched normal tissue |
Inconclusive | None | validation failed, tumor genotype appears to be homozygous reference |
Important Criteria
**Index column indicates the order in which the columns are expected**. **All
headers are case sensitive.** The Case Sensitive column specifies which values
are case sensitive. The Null column indicates which MAF columns are allowed to
have null values. The Enumerated column indicates which MAF columns have
specified values: an Enumerated value of "No" indicates that there are no
specified values for that column; other values indicate the specific values
listed allowed; a value of "Set" indicates that the MAF column values come from
a specified set of known values (*e.g.*HUGO gene symbols).
MAF file checks
The DCC Archive Validator checks the integrity of a MAF file. Validation will fail if any of the below are not true for a MAF file:
-
Column header text (including case) and order must match specification (Table 1) exactly
-
Values under column headers listed in the specification (Table 1) as not null must have values
-
Values that are specified in Table 1 as Case Sensitive must be
-
If column headers are listed in the specification as having enumerated values (i.e. a "Yes" in the "Enumerated" column), then the values under those column must come from the enumerated values listed under "Enumerated"
-
If column headers are listed in the specification as having set values (i.e. a "Set" in the "Enumerated" column), then the values under those column must come from the enumerated values of that domain (e.g. HUGO gene symbols)
-
All Allele-based columns must contain- (deletion), or a string composed of the following capitalized letters: A, T, G, C
-
IfValidation_Status== "Untested" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can be null (depending onValidation_Status)
- IfValidation_Status== "Inconclusive" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can be null (depending onValidation_Status)
-
If Validation_Status == Valid, then Validated_Tumor_Allele1 and Validated_Tumor_Allele2must be populated (one of A, C, G, T, and -)
-
If Validation_Status == "Valid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot be null
-
IfValidation_Status== "Invalid" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2cannot be null AND Tumor_Validation_Allelle1 == Match_Norm_Validation_Allele1AND Tumor_Validation_Allelle2 == Match_Norm_Validation_Allele2 (Added as a replacement for 8a as a result of breakdown)
-
-
Check allele values against Mutation_Status:
Check allele values against Validation_status:-
If Mutation_Status == "Germline" and Validation_Status == "Valid", then Tumor_Validation_Allele1 == Match_Norm_Validation_Allele1 and Tumor_Validation_Allele2 == Match_Norm_Validation_Allele2
-
If Mutation_Status == "Somatic" and Validation_Status == "Valid", then Match_Norm_Validation_Allele1 == Match_Norm_Validation_Allele2 == Reference_Allele and (Tumor_Validation_Allele1 or Tumor_Validation_Allele2) != Reference_Allele
-
If Mutation_Status == "LOH" and Validation_Status=="Valid", then Tumor_Validation_Allele1 == Tumor_Validation_Allele2 and Match_Norm_Validation_Allele1 != Match_Norm_Validation_Allele2 and Tumor_Validation_Allele1 == (Match_Norm_Validation_Allele1 or Match_Norm_Validation_Allele2)
-
-
Check that Start_position \<= End_position
-
Check for the Start_position and End_position against Variant_Type:
-
If Variant_Type == "INS", then (End_position - Start_position + 1 == length (Reference_Allele) or End_position - Start_position == 1) and length(Reference_Allele) \<= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
-
If Variant_Type == "DEL", then End_position - Start_position + 1 == length (Reference_Allele), then length(Reference_Allele) >= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
-
If Variant_Type == "SNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 1 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) != "-"
-
If Variant_Type == "DNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 2 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
-
If Variant_Type == "TNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
-
If Variant_Type == "ONP", then length(Reference_Allele) == length(Tumor_Seq_Allele1) == length(Tumor_Seq_Allele2) > 3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
-
-
Validation for UUID-based files:
-
Column #33 must be Tumor_Sample_UUID containing UUID of the BCR aliquot for tumor sample
-
Column #34 must be Matched_Norm_Sample_UUID containing UUID of the BCR aliquot for matched normal sample
-
Metadata represented by Tumor_Sample_Barcode and Matched_Norm_Sample_Barcode should correspond to the UUIDs assigned to Tumor_Sample_UUID and Matched_Norm_Sample_UUID respectively
-
-
If Validation_Status == "Valid" or "Invalid", then Validation_Method != "none" (case insensitive)
MAF naming convention
In archives uploaded to the DCC, the MAF file name should relate to the containing archive name in the following way:
If the archive has the name
\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>.\<revision\>.0.tar.gz
then a somatic MAF file with the archive should be named according to
\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>[.\<optional_tag\>].somatic.maf
and a protected MAF with the archive should be named according to
\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>[.\<optional_tag\>].protected.maf
The \<optional_tag> may consist of alphanumeric characters, dash, and underscore; no spaces or periods; or it may be left out altogether. The purpose of the optional tag is to impart some brief annotation.
Example
For the archive
genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.6.0.tar.gz
the following are examples of valid maf names
genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.somatic.maf
genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.protected.maf