GDC Data Model
Introduction
The GDC Data Model is the central method of organization of all data artifacts in the GDC. An overview of the data model, including a visual representation of its components, is provided on the GDC website. This section provides technical details about its implementation for data users, submitters, and developers.
Entities, Properties, and Links
Although the GDC Data Model may contain some cyclic elements, it can be helpful to think of it as a Directed Acyclic Graph (DAG) composed of interconnected entities. Each entity in the GDC has a set of properties and links.
- Properties are key-value pairs associated with an entity. Properties cannot be nested, which means that the value must be numerical, boolean, or a string, and cannot be another key-value set. Properties can be either required or optional. The following properties are of particular importance in constructing the GDC Data Model:
- Type is a required property for all entities. Entity types include
project
,case
,demographic
,sample
,read_group
and others. - System properties are properties used in GDC system operation and maintenance. They cannot be modified except under special circumstances.
- Unique keys are properties, or combinations of properties, that can be used to uniquely identify the entity in the GDC. For example, the tuple (combination) of
[ project_id, submitter_id ]
is a unique key for most entities, which means that althoughsubmitter_id
does not need to be unique in GDC, it must be unique within a project. See GDC Identifiers below for details.
- Type is a required property for all entities. Entity types include
- Links define relationships between entities, and the multiplicity of those relationships (e.g. one-to-one, one-to-many, many-to-many).
The GDC Data Dictionary determines which properties and links an entity can have according to entity type
.
Functionally similar entity types are grouped under the same category. For example, entity types slide_image
and submitted_unaligned_reads
belong to data_file
category, which comprises entities that represent downloadable files.
GDC Identifiers
UUIDs
When an entity is created, it is assigned a unique identifier in the form of a version 4 universally unique identifier (UUID). The UUID uniquely identifies the entity in the GDC, and is stored in the entity's id
property.
Program Name, Project Code, and Project ID
Programs are the highest level of organization of GDC datasets. Each program is assigned a unique program.name
property. Datasets within a program are organized into projects, and each project is assigned a project.code
property.
The project_id
property is associated with most entities in the GDC data model and is generated by appending project.code
to program.name
as follows:
program.name-project.code
(e.g. TCGA-LAML)
Note that program.name
never contains hyphens.
Submitter ID
In addition to UUIDs stored in the id
property, many entities also have a submitter_id
property. This property can contain any string that the submitter wishes to use to identify the entity (e.g. a "barcode"). This can be used to identify a corresponding entry in the submitter's records. The GDC requires that submitter_id
be unique for each entity within a project: the tuple (combination) of [ project_id, submitter_id ]
is a unique key.
Note: The submitter_id
of a case
entity corresponds to the submitted_subject_id
of the study participant in dbGaP records for the project.
Working with the GDC Data Model
Data Users
Users can access information stored in the GDC Data Model using the GDC Data Portal, the GDC API, and the GDC Data Transfer Tool. For more information see Data Access Processes and Tools.
Data Submitters
Data submitters can create and update submittable entities in the GDC Data Model and upload data files registered in the model using the GDC Data Submission Portal, the GDC API, and the GDC Data Transfer Tool. For more information see Data Submission Processes and Tools.