Copy Number Variation Analysis Pipeline

Introduction

The copy number variation (CNV) pipeline uses Affymetrix SNP 6.0 array data to identify genomic regions that are repeated and infer the copy number of these repeats. This pipeline is built onto the existing TCGA level 2 data generated by Birdsuite and uses the DNAcopy R-package to perform a circular binary segmentation (CBS) analysis [1]. CBS translates noisy intensity measurements into chromosomal regions of equal copy number. The final output files are segmented into genomic regions with the estimated copy number for each region. The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.

Data Processing Steps

The GRCh38 probe-set was produced by mapping probe sequences to the GRCh38 reference genome and can be downloaded at the GDC Reference File Website.

Copy Number Segmentation

The Copy Number Liftover Workflow uses the TCGA level 2 tangent.copynumber files described above. These files were generated by first normalizing array intensity values, estimating raw copy number, and performing tangent normalization, which subtracts variation that is found in a set of normal samples. Original array intensity values (TCGA level 1) are available in the GDC Legacy Archive under the "Data Format: CEL" and "Platform: Affymetrix SNP 6.0" filters.

The Copy Number Liftover Workflow performs CBS analysis using the DNACopy R-package to process tangent normalized data into Copy Number Segment files, which associate contiguous chromosome regions with log2 ratio segment means in a tab-delimited format. The number of probes with intensity values associated with each chromosome region is also reported (probes with no intensity values are not included in this count). During copy number segmentation probe sets from Pseudo-Autosomal Regions (PARs) were removed from males and Y chromosome segments were removed from females.

Masked copy number segments are generated using the same method except that a filtering step is performed that removes the Y chromosome and probe sets that were previously indicated to be associated with frequent germline copy-number variation.

I/O Entity Format
Input Submitted Tangent Copy Number TXT
Output Copy Number Segment or Masked Copy Number Segment TXT

Copy Number Estimation

Numeric focal-level Copy Number Alteration (CNA) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 [2], [3] on a project level. Only protein-coding genes were kept, and their numeric CNA values were further thresholded by a noise cutoff of 0.3:

  • Genes with focal CNA values smaller than -0.3 are categorized as a "loss" (-1)
  • Genes with focal CNA values larger than 0.3 are categorized as a "gain" (+1)
  • Genes with focal CNA values between and including -0.3 and 0.3 are categorized as "neutral" (0).

Values are reported in a project-level TSV file. Each row represents a gene, which is reported as an Ensembl ID and associated cytoband. The columns represent aliquots, which are associated with CNA value categorizations (0/1/-1) for each gene.

I/O Entity Format
Input Masked Copy Number Segment TXT
Output Copy Number Estimate TXT

GISTIC2 Command Line Parameters

gistic2 \
-b <base_directory> \
-seg <segmentation_file> \
-mk <marker_file> \
-refgene <reference_gene_file> \
-ta 0.1 \
-armpeel 1 \
-brlen 0.7 \
-cap 1.5 \
-conf 0.99 \
-td 0.1 \
-genegistic 1 \
-gcm extreme \
-js 4 \
-maxseg 2000 \
-qvt 0.25 \
-rx 0 \
-savegene 1 \
(-broad 1)

File Access and Availability

Type Description Format
Copy Number Segment A table that associates contiguous chromosomal segments with genomic coordinates, mean array intensity, and the number of probes that bind to each segment. TXT
Masked Copy Number Segment A table with the same information as the Copy Number Segment except that segments with probes known to contain germline mutations are removed. TXT

[1] Olshen, Adam B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. "Circular binary segmentation for the analysis of array-based DNA copy number data." Biostatistics 5, no. 4 (2004): 557-572.

[2] Mermel, Craig H., Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, Rameen Beroukhim, and Gad Getz. "GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome biology 12, no. 4 (2011): R41.

[3] Beroukhim, Rameen, Craig H. Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina et al. "The landscape of somatic copy-number alteration across human cancers." Nature 463, no. 7283 (2010): 899.