Mutation Annotation Format (MAF) - Legacy TCGA Specification

This definition was taken from the previously public wiki hosted by TCGA and reflects the MAF format that was available during the active period of the TCGA project.

Document Information

The spec has been reverted to the June 26th version (version 20). Additional changes are the removal of the "under construction" banner, changing all text to black, and fixing a typo in the link to the MAF 2.2 specification.

Specification for Mutation Annotation Format
Version 2.4.1
June 20, 2014

Contents

1 Current version changes
2 About MAF specifications
- 2.1 Definition of open access MAF data
- 2.2 Somatic MAF vs. Protected MAF
3 MAF file fields
- 3.1 Table 1 - File column headers
4 MAF file checks
5 MAF naming convention
6 Previous specification versions

Current version changes

This current revision is version 2.4.1 of the Mutation Annotation Format (MAF) specification.

The following items in the specification were added or modified in version 2.4.1 from version 2.4:

Header for MAF file is "#version 2.4.1"
"Somatic" and "None" are the only acceptable values for "Mutation_Status" for a somatic.MAF (named .somatic.maf). When Mutation_Status is None, Validation_Status must be Invalid.
Centers need to make sure that Mutations_Status "None" doesn't include germline mutation.
For a somatic MAF, following rules should be satisfied:
SOMATIC = (A AND (B OR C OR D)) OR (E AND F)
A: Mutation_Status == "Somatic"
B: Validation_Status == "Valid"
C. Verification_Status == "Verified"
D. Variant_Classification is not {Intron, 5'UTR, 3'UTR, 5'Flank, 3'Flank, IGR}, which implies that Variant_Classification can only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region}.
E: Mutations_status == "None"
F: Validation_status == "Invalid"
Extra validation rules: If Validation_Status == Valid or Invalid, then Validation_Method != none (case insensitive).

About MAF specifications

Mutation annotation files should be transferred to the DCC. Those files should be formatted using the mutation annotation format (MAF) that is described below. File naming convention is also below.

Following categories of somatic mutations are reported in MAF files:

Missense and nonsense
Splice site, defined as SNP within 2 bp of the splice junction
Silent mutations
Indels that overlap the coding region or splice site of a gene or the targeted region of a genetic element of interest
Frameshift mutations
Mutations in regulatory regions

Definition of open access MAF data

A large proportion of MAFs are submitted as discovery data and sites labeled as somatic in these files overlap with known germline variants. In order to minimize germline contamination in putative (unvalidated) somatic calls, certain filtering criteria have been imposed. Based on current policy, open access MAF data should:

include all validated somatic mutation calls
include all unvalidated somatic mutation calls that overlap with a coding region or splice site
exclude all other types of mutation calls (i.e., non-somatic calls (validated or not), unvalidated somatic calls that are not in coding region or splice sites, and dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM)

Somatic MAF vs. Protected MAF

Centers will submit to the DCC MAF archives that contain Somatic MAF (named.somatic.maf) for open access data and an all-inclusive Protected MAF (named.protected.maf) that does not filter any data out and represents the original super-set of mutation calls. The files will be formatted using the Mutation Annotation Format (MAF).

The following table lists some of the critical attributes of somatic and protected MAF files and provides a comparison.

Attribute	Somatic MAF	Protected MAF
File naming	Somatic MAFs should be named as*.somatic.mafand cannot contain 'germ' or 'protected' in file name.	Protected MAFs should be named as*.protected.mafand should not contain 'somatic' in the file name.
Mutation category	Somatic MAFs can only contain entries whereMutation_Statusis "Somatic". If any other value is assigned to the field, the archive will fail. Experimentally validated or unvalidated (see next row) somatic mutations can be included in the file.	There is no such restriction for protected MAF. The file should contain all mutation calls including those from which .somatic.maf is derived.
Filtering criteria	In order to minimize germline contamination, somatic MAFs can contain unvalidated somatic mutations only from coding regions and splice sites, which implies:	There are no such constraints for mutations in protected MAF.
	If Validation_Status is"Unknown",V a riant_Classification cannot be 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, or Intron.Variant_Classificationcan only be \{Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame\}.
	There is no such constraint for experimentally validated (Validation_Statusis "Valid") somatic mutations.

	dbSNP sites that are not annotated as somatic in dbSNP, COSMIC or OMIM must be removed from somatic MAFs.
Access level	These files are deployed as open access data.	These files are deployed as protected data.

MAF file fields

The format of a MAF file is tab-delimited columns. Those columns are described in Table 1 and are required in every MAF file. The order of the columns will be validated by the DCC. Column headers and values are case sensitive where specified. Columns may allow null values (i.e._ blank cells) and/or have enumerated values. The validator looks for a header stating the version of the specification to validate against (e.g. #version 2.4). If not, validation fails. Any columns that come after the columns described in Table 1 are optional. Optional columns are not validated by the DCC and can be in any order.

Table 1 - File column headers

Index	MAF Column Header	Description of Values	Example	Case Sensitive	Null	Enumerated
1	Hugo_Symbol	HUGO symbol for the gene (HUGO symbols are always in all caps). If no gene exists within 3kb enter "Unknown".	EGFR	Yes	No	Set or Unknown
		Source: http://genenames.org
2	Entrez_Gene_Id	Entrez gene ID (an integer). If no gene exists within 3kb enter "0".	1956	No	No	Set
		Source: http://ncbi.nlm.nih.gov/sites/entrez?db=gene
3	Center	Genome sequencing center reporting the variant. If multiple institutions report the same mutation separate list using semicolons. Non-GSC centers will be also supported if center name is an accepted center name.	hgsc.bcm.edu;genome.wustl.edu	Yes	No	Set
4	NCBI_Build	Any TGCA accepted genome identifier. Can be string, integer or a float.	hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37,	No	No	Set and Enumerated.
5	Chromosome	Chromosome number without "chr" prefix that contains the gene.	X, Y, M, 1, 2, etc.	Yes	No	Set
6	Start_Position	Lowest numeric position of the reported variant on the genomic reference sequence. Mutation start coordinate (1-based coordinate system).	999	No	No	Set
7	End_Position	Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system).	1000	No	No	Set
8	Strand	Genomic strand of the reported allele. Variants should always be reported on the positive genomic strand. (Currently, only the positive strand is an accepted value).	+	No	No	+
9	Variant_Classification	Translational effect of variant allele.	Missense_Mutation	Yes	No	Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR (See Notes Section #1) , Intron, RNA, Targeted_Region
10	Variant_Type	Type of mutation. TNP (tri-nucleotide polymorphism) is analogous to DNP but for 3 consecutive nucleotides. ONP (oligo-nucleotide polymorphism) is analogous to TNP but for consecutive runs of 4 or more.	INS	Yes	No	SNP, DNP, TNP, ONP, INS, DEL, or Consolidated (See Notes Section #2) )
11	Reference_Allele	The plus strand reference allele at this position. Include the sequence deleted for a deletion, or "-" for an insertion.	A	Yes	No	A,C,G,T and/or -
12	Tumor_Seq_Allele1	Primary data genotype. Tumor sequencing (discovery) allele 1. " -" for a deletion represent a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases.	C	Yes	No	A,C,G,T and/or -
13	Tumor_Seq_Allele2	Primary data genotype. Tumor sequencing (discovery) allele 2. " -" for a deletion represents a variant. "-" for an insertion represents wild-type allele. Novel inserted sequence for insertion should not include flanking reference bases.	G	Yes	No	A,C,G,T and/or -
14	dbSNP_RS	Latest dbSNP rs ID (dbSNP_ID) or "novel" if there is no dbSNP record. source: http://ncbi.nlm.nih.gov/projects/SNP/	rs12345	Yes	Yes	Set or "novel"
15	dbSNP_Val_Status	dbSNP validation status. Semicolon- separated list of validation statuses.	by2Hit2Allele;byCluster	No	Yes	by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele (See Notes Section #3) Note that "none" will no longer be an acceptable value.
16	Tumor_Sample_Barcode	BCR aliquot barcode for the tumor sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID.	TCGA-02-0021-01A-01D-0002-04	Yes	No	Set
17	Matched_Norm_Sample_Barcode	BCR aliquot barcode for the matched normal sample including the two additional fields indicating plate and well position. i.e. TCGA-SiteID-PatientID-SampleID-PortionID-PlateID-CenterID. The full TCGA Aliquot ID; e.g. TCGA-02-0021-10A-01D-0002-04 (compare portion ID '10A' normal sample, to '01A' tumor sample).	TCGA-02-0021-10A-01D-0002-04	Yes	No	Set
18	Match_Norm_Seq_Allele1	Primary data. Matched normal sequencing allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	T	Yes	Yes	A,C,G,T and/or -
19	Match_Norm_Seq_Allele2	Primary data. Matched normal sequencing allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	ACGT	Yes	Yes	A,C,G,T and/or -
20	Tumor_Validation_Allele1	Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	-	Yes	Yes	A,C,G,T and/or -
21	Tumor_Validation_Allele2	Secondary data from orthogonal technology. Tumor genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	A	Yes	Yes	A,C,G,T and/or -
22	Match_Norm_Validation_Allele1	Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 1. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	C	Yes	Yes	A,C,G,T and/or -
23	Match_Norm_Validation_Allele2	Secondary data from orthogonal technology. Matched normal genotyping (validation) for allele 2. "-" for deletions; novel inserted sequence for INS not including flanking reference bases.	G	Yes	Yes	A,C,G,T and/or -
24	Verification_Status (See Notes Section #4)	Second pass results from independent attempt using same methods as primary data source. Generally reserved for 3730 Sanger Sequencing.	Verified	Yes	Yes	Verified, Unknown
25	Validation_Status (See Notes Section #5)	Second pass results from orthogonal technology.	Valid	Yes	No	Untested, Inconclusive, Valid, Invaild
26	Mutation_Status	Updated to reflect validation or verification status and to be in agreement with the VCF VLS field. The values allowed in this field are constrained by the value in the Validation_Status field.	Somatic	Yes	No	Validation_Status values: Untested, Inconslusive, Valid, Invalid - Allowed Mutations_Status Values for Untested and Inconclusive: (See Notes Seciton #6) None, Germline, Somatic, LOH, Post-transcriptional modification Unknown Allowed Mutation_status Values for Valid: (See Notes Seciton #6) Germline, Somatic, LOH, Post-transcriptional modification, Unknown - Allowed Mutations_Status Values for Invalid: (See Notes Seciton #6) none

27	Sequencing_Phase	TCGA sequencing phase. Phase should change under any circumstance that the targets under consideration change.	Phase_I	No	Yes	No
28	Sequence_Source	Molecular assay type used to produce the analytes used for sequencing. Allowed values are a subset of the SRA 1.5 library_strategy field values. This subset matches those used at CGHub.	WGS;WXS	Yes	No	Common TCGA values: WGS, WGA, WXS, RNA-Seq, miRNA-Seq, Bisulfite-Seq, VALIDATION, Other Other allowed values (per SRA 1.5) ncRNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, Tn-Seq, FAIRE-seq, SELEX, RIP-Seq, ChIA-PET

29	Validation_Method	The assay platforms used for the validation call. Examples: Sanger_PCR_WGA, Sanger_PCR_gDNA, 454_PCR_WGA, 454_PCR_gDNA; separate multiple entries using semicolons.	Sanger_PCR_WGA;Sanger_PCR_gDNA	No	NO. If Validation_Status = Untested then "none" If Validation_Status = Valid or Invalid, then not "none" (case insensitive)	No
30	Score	Not in use.	NA	No	Yes	No
31	BAM_File	Not in use.	NA	No	Yes	No
32	Sequencer	Instrument used to produce primary data. Separate multiple entries using semicolons.	Illumina GAIIx;SOLID	Yes	No	Illumina GAIIx, Illumina HiSeq, SOLID, 454, ABI 3730xl, Ion Torrent PGM, Ion Torrent Proton, PacBio RS, Illumina MiSeq, Illumina HiSeq 2500, 454 GS FLX Titanium, AB SOLiD 4 System
33	Tumor_Sample_UUID	BCR aliquot UUID for tumor sample	550e8400-e29b-41d4-a716-446655440000	Yes	No
34	Matched_Norm_Sample_UUID	BCR aliquot UUID for matched normal	567e8487-e29b-32d4-a716-446655443246	Yes	No

Notes
1 Intergenic Region.
2 Consolidationd is used to indicate a site that was initially reported as a variant but subsequently removed from further analysis because it was consolidated into a new variant. For example, a SNP variant incorporated into a TNP variant.
3 Used when the discovered varieant differs from that of dbSNP.
4 These MAF headers describe the technology that was used to confirm a mutation, whether the same technology ("verification") or a different technology ("validation") is used to prove that a variant is germline or a somatic mutation.
5 These MAF headers describe the technology that was used toconfirm a mutation, whether the same technology (verification) or a different technology (validation) is used to prove that a variant is germline or a somatic mutation.
6 Explanation of some Validation Status-Mutation Status combinations.

Validation Status	Mutation Status	Explanation
Valid	Unknown	a valid variant with unknown somatic status due to lack of data from matched normal tissue.
Invalid	None	validation attempted, tumor and normal are homozygous reference (formerly described as Wildtype)
Inconclusive	Unknown	validation failed, neither the genotype nor its somatic status is certain due to lack of data from matched normal tissue
Inconclusive	None	validation failed, tumor genotype appears to be homozygous reference

Important Criteria

**Index column indicates the order in which the columns are expected**. **All
headers are case sensitive.** The Case Sensitive column specifies which values
are case sensitive. The Null column indicates which MAF columns are allowed to
have null values. The Enumerated column indicates which MAF columns have
specified values: an Enumerated value of "No" indicates that there are no
specified values for that column; other values indicate the specific values
listed allowed; a value of "Set" indicates that the MAF column values come from
a specified set of known values (*e.g.*HUGO gene symbols).

MAF file checks

The DCC Archive Validator checks the integrity of a MAF file. Validation will fail if any of the below are not true for a MAF file:

Column header text (including case) and order must match specification (Table 1) exactly
Values under column headers listed in the specification (Table 1) as not null must have values
Values that are specified in Table 1 as Case Sensitive must be
If column headers are listed in the specification as having enumerated values (i.e. a "Yes" in the "Enumerated" column), then the values under those column must come from the enumerated values listed under "Enumerated"
If column headers are listed in the specification as having set values (i.e. a "Set" in the "Enumerated" column), then the values under those column must come from the enumerated values of that domain (e.g. HUGO gene symbols)
All Allele-based columns must contain- (deletion), or a string composed of the following capitalized letters: A, T, G, C
IfValidation_Status== "Untested" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can be null (depending onValidation_Status)
1. IfValidation_Status== "Inconclusive" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2can be null (depending onValidation_Status)
If Validation_Status == Valid, then Validated_Tumor_Allele1 and Validated_Tumor_Allele2must be populated (one of A, C, G, T, and -)
1. If Validation_Status == "Valid" then Tumor_Validation_Allele1, Tumor_Validation_Allele2, Match_Norm_Validation_Allele1, Match_Norm_Validation_Allele2 cannot be null
2. IfValidation_Status== "Invalid" thenTumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2cannot be null AND Tumor_Validation_Allelle1 == Match_Norm_Validation_Allele1AND Tumor_Validation_Allelle2 == Match_Norm_Validation_Allele2 (Added as a replacement for 8a as a result of breakdown)
Check allele values against Mutation_Status:
Check allele values against Validation_status:
1. If Mutation_Status == "Germline" and Validation_Status == "Valid", then Tumor_Validation_Allele1 == Match_Norm_Validation_Allele1 and Tumor_Validation_Allele2 == Match_Norm_Validation_Allele2
2. If Mutation_Status == "Somatic" and Validation_Status == "Valid", then Match_Norm_Validation_Allele1 == Match_Norm_Validation_Allele2 == Reference_Allele and (Tumor_Validation_Allele1 or Tumor_Validation_Allele2) != Reference_Allele
3. If Mutation_Status == "LOH" and Validation_Status=="Valid", then Tumor_Validation_Allele1 == Tumor_Validation_Allele2 and Match_Norm_Validation_Allele1 != Match_Norm_Validation_Allele2 and Tumor_Validation_Allele1 == (Match_Norm_Validation_Allele1 or Match_Norm_Validation_Allele2)
Check that Start_position \<= End_position
Check for the Start_position and End_position against Variant_Type:
1. If Variant_Type == "INS", then (End_position - Start_position + 1 == length (Reference_Allele) or End_position - Start_position == 1) and length(Reference_Allele) \<= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
2. If Variant_Type == "DEL", then End_position - Start_position + 1 == length (Reference_Allele), then length(Reference_Allele) >= length(Tumor_Seq_Allele1 and Tumor_Seq_Allele2)
3. If Variant_Type == "SNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 1 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) != "-"
4. If Variant_Type == "DNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 2 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
5. If Variant_Type == "TNP", then length(Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) == 3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
6. If Variant_Type == "ONP", then length(Reference_Allele) == length(Tumor_Seq_Allele1) == length(Tumor_Seq_Allele2) > 3 and (Reference_Allele and Tumor_Seq_Allele1 and Tumor_Seq_Allele2) !contain "-"
Validation for UUID-based files:
1. Column #33 must be Tumor_Sample_UUID containing UUID of the BCR aliquot for tumor sample
2. Column #34 must be Matched_Norm_Sample_UUID containing UUID of the BCR aliquot for matched normal sample
3. Metadata represented by Tumor_Sample_Barcode and Matched_Norm_Sample_Barcode should correspond to the UUIDs assigned to Tumor_Sample_UUID and Matched_Norm_Sample_UUID respectively
If Validation_Status == "Valid" or "Invalid", then Validation_Method != "none" (case insensitive)

MAF naming convention

In archives uploaded to the DCC, the MAF file name should relate to the containing archive name in the following way:

If the archive has the name

\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>.\<revision\>.0.tar.gz

then a somatic MAF file with the archive should be named according to

\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>[.\<optional_tag\>].somatic.maf

and a protected MAF with the archive should be named according to

\<domain\>_\<disease_abbrev\>.\<platform\>.Level_2.\<serial_index\>[.\<optional_tag\>].protected.maf

The \<optional_tag> may consist of alphanumeric characters, dash, and underscore; no spaces or periods; or it may be left out altogether. The purpose of the optional tag is to impart some brief annotation.

Example

For the archive

genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.6.0.tar.gz

the following are examples of valid maf names

genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.somatic.maf
genome.wustl.edu_OV.IlluminaGA_DNASeq.Level_2.7.protected.maf