Harmonized data refers to the collection of raw data from multiple sources that has been normalized so that a valid comparison can be made across these sources.
Data harmonization is one of the underlying principles on which the GDC is based. Genomic data is typically collected, processed, and analyzed on a project-level basis by many different groups. Even the most similar projects cannot always be compared in a valid way due to small differences across data processing and analysis pipelines. The GDC collects raw data from many cancer projects and processes them using standardized pipelines1 and the reference genome GRCh382. This gives the advantage of analyzing multiple cancer types or the same cancer type across multiple projects.
GDC Data is harmonized using carefully curated bioinformatics pipelines and produces somatic variant call, gene expression, copy number variation estimation, and methylation data. Clinical and biospecimen data are also harmonized by making a set of elements common to all projects available for download through the API. As new projects are submitted to the GDC, incoming data is reviewed by a team of bioinformaticians who determine how to proceed with harmonization based on the data type, quality, and available computational resources.