Gene Expression Clustering Tool
Introduction to Gene Expression Clustering
The Gene Expression Clustering tool is a web-based tool for performing sample clustering by selecting a desired set of genes from the NCI Genomic Data Commons (GDC), and visualizing a heatmap of a z-score transformed matrix.
Quick Reference Guide
At the Analysis Center, click the 'Gene Expression Clustering' card to launch the heatmap.
Users can view publicly available genes as well as login with credentials to access controlled data.
There are four main panels in the Gene Expression Clustering tool: controls, heatmap, variables, and legend.
Controls
The control panel can modify the displayed data or the appearance of the matrix. Their functionalities are outlined below.
- Clustering: Modify the clustering method, the distance method, alter the column and row dendrogram dimensions, change the z-score cap and color scheme.
- Cases: Adjust the visible characters of the case labels
- Genes: Modify how cases are represented for each gene (Absolute, Percent, or None), row group and label lengths, rendering style, and the existing gene set
- Edit Group: Displays a panel of currently selected genes, which can be modified by clicking on a gene to remove it from the gene set, searching for a particular gene to add, loading top variably expressed genes, or loading a pre-defined gene set provided by the MSigDB database
- Create Group: Create a new gene set by searching for a particular gene, loading top mutated genes, or loading a pre-defined gene set provided by the MSigDB database
- Variables: Search and select variables to add to the matrix below the heatmap
- Cell Layout: Modify the format of the cells by changing colors, cell dimensions, and label formatting
- Legend Layout: Alter the legend by changing the font size, dimensions, and other formatting preferences
- Download: Download the plot in svg format
- Zoom: Adjust the zoom level by using the up and down arrows on the input box, entering a number, or using the sliding scale to view the case labels.
Heatmap
The Gene Expression Clustering heatmap displays the active cohort's cases along the top horizontally, genes along the left column, and the z-score transformed gene expression value.
Hovering over a cell in the heatmap displays the case submitter_id, gene name, and gene expression value.
Clicking on a cell also gives users the option to launch the Disco plot, a circos plot displaying copy number data and consequences for that case.
Selecting cases on the cluster
Cases on the cluster can be selected by clicking on the top dendrogram. Once part of the dendrogram is selected, users can choose to zoom in to the cases, list all highlighted cases, or create a cohort of the selected cases.
Click on a case in the dendrogram to showcase the Disco plot or the GDC Case Summary Page.
Selecting genes on the cluster
Genes on the cluster can be selected by clicking on the left dendrogram. Once part of the dendrogram is selected, users can choose to list the genes selected or launch Gene Set Overrepresentation Analysis
with the genes selected.
Clicking Gene set overrepresentation analysis
will lauch an ORA chart above, after selecting a Gene set group, a table will be shown as the result of the Gene set overrepresentation analysis.
In the column of genes on the left, click on a gene to rename it, launch the ProteinPaint Lollipop plot, display the GDC Gene Summary Page, or remove the gene. The lollipop plot displays all cases across the GDC affected by SSMs in the selected gene.
Variables
Any variables added to the matrix appear below the heatmap. Users can hover over a cell to display the case submitter_id and their value for the given variable.
Click on a variable to rename it, edit it by excluding categories, replace it with a different variable, or remove it entirely.
When editing the "Overall Survival" variable, users can choose between Time to Event
or Exit Code
. If Time to Event
is selected, users have the option to convert the values to z-scores.
Users can drag and drop a variable row that isn't used for clustering to reposition it
Legend
In addition to the color coding system for the gene expression values, the legend displays the number of cases from the active cohort in each category for all variables that are selected to appear in the matrix.
Users can click on a variable in the legend to hide a specific category, only show a specific category, or show all categories for the selected variable.
Accessing the Tool
At the analysis center, click the 'Gene Expression Clustering' card to launch the heatmap.
View publicly available genes as well as login with credentials to access controlled data.
Features
The following features are viewable once the default heatmap is loaded. The default heatmap shows all the glioma cases. There are four main panels as outlined in the figure i.e., the 'Controls', 'Heatmap', 'Variables' and the 'Legend'. Each of the features and functionalities are described in detail in the following sections.
Controls
The control panel as shown has various functionalities with which users can change or modify the appearance of the matrix. The control panel provides flexibility and a wide range of options to maximize user control.
Clustering
The clustering control button provides several options to modify the default clustering of the heatmap. Click on the button labeled 'Clustering' to display a menu with options as shown.
Cluster cases
check/uncheck to show/hide the column dendrogram
Clustering method
Click on the options to change the method of clustering. The heatmap will render again with the clustering method selected.
Distance method
Click on the options to change the distance method. The heatmap will render again with the distance method selected.
Column Dendrogram Height
Click or edit the number in the input box to adjust the height of the column dendrograms as shown.
Row Dendrogram Width
Similary, row dendrogram width can also be modified as per user requirement as shown.
Z-score Cap
Z scores are used to compare gene expression across samples. A Z-score of zero indicates that the gene's expression level is the same as the mean expression level across all samples, while a positive Z-score indicates that the gene is expressed at a higher level than the mean, and a negative Z-score indicates that the gene is expressed at a lower level than the mean.
User can increase or decrease the Z-score Capping. Increase the Z-score cap from 5 to 10 as shown. Samples with lower gene expression gets lighter to allow highlighting of clusters with higher expression values as shown in red in the heatmap.
Color Scheme
Click on the options to change the color scheme used. The heatmap will render again with the color scheme selected.
Cases
The 'Cases' control has these options:
Case Label Character Limit
adjust the visible characters of these sample labels. The default is '32'. Note that reducing the character limit truncates the labels.
Group Cases By
Clicking the "+" allows users to select a term to group cases by the categories in the term.
Sort Case Priority
Allows users to set case sorting priority. The default sort setting sorts the cases 'by presence' under 'Basic' sort settings. To change sorting click on the 'Cases' tab.
Then click the second option by consequence
to change the sorting. The clustering reloads with the new sorting.
To perform an advanced sorting, click 'Advanced' on the 'Sort Case Priority' menu as shown below.
Now user has the option to sort the cases by each selected row, gene mutation, dictionary variable or alphabetically by name. Details of each sort option are provided.
Genes
User can modify the existing default gene set by clicking the 'Genes' button in the controls as shown. This displays the option to edit genes as well as variables from the dropdown as shown.
Modifying Genes
Click the 'Edit Current Group' button as shown in the 'Gene set' to display a panel of current selected genes.
Add/Delete a gene
In the search box, type in any gene name for example 'Wee1' as shown and click submit.
The heatmap loads again after performing a clustering that includes 'WEE1' as shown.
Click on the 'Edit' functionality again within the 'Gene set' menu option. To delete a gene, hover over the gene as shown. A red cross mark will appear as shown.
Click on the gene 'Wee1' to delete the gene from the gene set. Click submit to redo the clustering.
Load top variably expressed genes
User has the option to load the top genes that are variably expressed. To do so, click on the 'Edit Selected Group' under 'Genes' controls. Click on the,'Top variably expressed genes' button.
Here the user has the option to select 'Gene count' with the minimum cutoff to narrow down the list of top variably expressed genes. Additionally, a user has the option to choose all genes or show genes from a subset.
Load MSigDB gene set
The gene expression clustering tool also enables users to load a pre-defined gene set provided by the MSigDB database. The current version enabled is the latest. Click on the dropdown button 'MSigDB (2023.2.Hs) gene set' and choose one of the following gene sets as shown.
For example, select a hallmark gene set for 'Hypoxia' as shown.
Note the info icon next to the gene set that provides additional information about this gene set as well as a link to the database and the original publication PMID as shown.
Upon selecting a MSigDB gene set, the genes get updated as shown.
Click 'Submit' to reload the heatmap with the new gene set from MSigDB.
Load gene set
The gene expression clustering tool also enables users to load a user-saved custom gene sets. Click on Load gene set
and choose one of the user-saved custom gene sets.
Adding gene as a variable
Users also have the option to add gene variant terms as variables to line up mutation consequences with clustered gene expression data.
To do so, click the Genes
, and Create New Group
after typing a group name.
Click Submit
to reload the heatmap with the newly added KRAS gene as a variable. This displays the consequence type for the clustered samples for which KRAS has both the mutation calls and the gene expression data as shown.
Variables
The button 'Variables' in the controls allows the user to search and select variables that get added below the heatmap.
Click the button 'Variables' to show the following dictionary tree.
Click the '+' button on the 'Demographic' to display all the terms under the parent term as shown. Select terms 'Ethnicity' and 'Year of birth' and click 'Submit 2 terms'.
Once the variable terms are submitted, the heatmap will display the added variables as shown.
Download
The control panel shows an option to download the plot as an svg after user has specified their customizations. Select the 'Download' button as shown below to save the visualization in either SVG or TSV format.
If svg format is selected then the download will get saved to the default download folder as shown at the bottom of the browser window.
Adjusting the zoom using the zoom buttons
Adjust the zoom level by using arrows on the input box or entering a number to be able to view the sample lables as shown.
Heatmap
Selecting cases on the cluster
Cases on the cluster can be selected interactively by clicking on the column dendrograms. Click on the dendrograms above the heatmap as shown. The dendrograms get highlighted in red.
Once the dendrograms are selected, two options are displayed. A user can choose to zoom in the cases or list all the cases highlighted in the dendrograms.
Clicking a case column
Click on a case label to display the options as shown.
User may choose to launch: - a circos plot by clicking 'Disco plot' button, - a webpage containing information about the case by clicking the case id - Gene summary page by clicking on the gene name 'PDGFRA'
Clicking a gene label
Click on a gene row label to display the following options
User can choose to change variable name by deleting and typing in a new name in the box where 'PDGFRA' is currently applied. User may also choose to launch the lollipop plot or gene summary page or remove this row entirely.
Hovering over/Clicking a cell
Hover over a cell of the heatmap to show information about the case. The information displayed shows the case id, the gene name (CCND1) and the z-score transformed value (4.04..)
Variables
Clicking a Variable
Click on a variable (for example 'Project id' here) row label to display the options as shown.
User can change the variable name (input box), edit the variable to exclude categories ('Edit' button), replace the variable by another one ('Replace' button) or remove the row containing the variable entirely by clicking the 'Remove' button.
Renaming a variable
To rename a variable, edit the default name of the variable in the input box as shown.
After renaming the variable as per user preference, click 'submit'. The row now shows a new variable name.
Editing a variable
To edit groups within the variable, click the 'Edit' button. Now, user can drag the categories from group 1 into group 2 to create two separate groups and also have the option to exclude a category. After making the choice, click 'Apply' to reload the chart.
When editing the "Overall Survival" variable, users can choose between Time to Event
or Exit Code
. If Time to Event
is selected, users have the option to convert the values to z-scores.
Users can drag and drop a variable row that isn't used for clustering to reposition it
Replacing a variable
To replace a variable, click on the row label for that variable and click Replace
. This shows the GDC dictionary from which a user can select a variable of choice as shown.
Removing a variable
To remove a row containing a variable entirely, click on the row label for that variable and click 'Remove'. This removes the entire row from the heatmap.
Legend
Interacting with legend filters
Variables can be filtered upon via the legend. Click a legend item to display the following options. User may choose to 'Hide', 'Show only', or 'Show all' categories from a selected variable. This would allow the user to filter down on the category of choice.