GDC Data Model

Introduction

The GDC Data Model is the central method of organization of all data artifacts in the GDC. An overview of the data model, including a visual representation of its components, is provided on the GDC website. This section provides technical details about its implementation for data users, submitters, and developers.

Entities, Properties, and Links

Although the GDC Data Model may contain some cyclic elements, it can be helpful to think of it as a Directed Acyclic Graph (DAG) composed of interconnected entities. Each entity in the GDC has a set of properties and links.

Properties are key-value pairs associated with an entity. Properties cannot be nested, which means that the value must be numerical, boolean, or a string, and cannot be another key-value set. Properties can be either required or optional. The following properties are of particular importance in constructing the GDC Data Model:
- Type is a required property for all entities. Entity types include project, case, demographic, sample, read_group and others.
- System properties are properties used in GDC system operation and maintenance. They cannot be modified except under special circumstances.
- Unique keys are properties, or combinations of properties, that can be used to uniquely identify the entity in the GDC. For example, the tuple (combination) of [ project_id, submitter_id ] is a unique key for most entities, which means that although submitter_id does not need to be unique in GDC, it must be unique within a project. See GDC Identifiers below for details.
Links define relationships between entities, and the multiplicity of those relationships (e.g. one-to-one, one-to-many, many-to-many).

The GDC Data Dictionary determines which properties and links an entity can have according to entity type.

Functionally similar entity types are grouped under the same category. For example, entity types slide_image and submitted_unaligned_reads belong to data_file category, which comprises entities that represent downloadable files.

GDC Identifiers

UUIDs

When an entity is created, it is assigned a unique identifier in the form of a version 4 universally unique identifier (UUID). The UUID uniquely identifies the entity in the GDC, and is stored in the entity's id property.

Program Name, Project Code, and Project ID

Programs are the highest level of organization of GDC datasets. Each program is assigned a unique program.name property. Datasets within a program are organized into projects, and each project is assigned a project.code property.

The project_id property is associated with most entities in the GDC data model and is generated by appending project.code to program.name as follows:

program.name-project.code
(e.g. TCGA-LAML)

Note that program.name never contains hyphens.

Submitter ID

In addition to UUIDs stored in the id property, many entities also have a submitter_id property. This property can contain any string that the submitter wishes to use to identify the entity (e.g. a "barcode"). This can be used to identify a corresponding entry in the submitter's records. The GDC requires that submitter_id be unique for each entity within a project: the tuple (combination) of [ project_id, submitter_id ] is a unique key.

Note: The submitter_id of a case entity corresponds to the submitted_subject_id of the study participant in dbGaP records for the project.

Working with the GDC Data Model

Data Users

Users can access information stored in the GDC Data Model using the GDC Data Portal, the GDC API, and the GDC Data Transfer Tool. For more information see Data Access Processes and Tools.

Data Submitters

Data submitters can create and update submittable entities in the GDC Data Model and upload data files registered in the model using the GDC Data Submission Portal, the GDC API, and the GDC Data Transfer Tool. For more information see Data Submission Processes and Tools.