What's that cluster?

Apr 8, 20245 min read

Today I'm going to talk about a feature in flowcytoscript--an R-based workflow for flow-cytometry analysis from the Liston Lab--that automates cluster identification and labeling. I used this recently in the dimensionality reduction post, and I believe it wasn't clear that the cluster annotation was all automated.

The cluster names on this PaCMAP plot are all automatically assigned.

How many times have you read a paper that says something like "Cluster-19 was overrepresented in patients with disease X"? Not a very meaningful statement, is it? It would be nice to know what kind of cells are in Cluster-19, right? By default, clustering algorithms don't tell us this information or assign meaningful biological names to clusters. FlowSOM gets part of the way there by giving us star charts showing the expression of markers by cluster. This is helpful for the scientist to go through and manually assign cluster names, but it still requires a tedious process of looking at each cluster's expression pattern and matching that to knowledge you have or is present in databases.

In single cell RNA-sequencing, the issue of cluster identification has many published solutions. Flow cytometry and scRNA-Seq data, however, have important differences. With scRNA-Seq, even common transcripts may not be detected in every cell, so gating-type solutions with thresholds are not a viable solution. Clustering helps to group similar cells despite drop-outs. The transcriptional profiles can be matched to existing datasets where the cell types have been annotated by experts. With flow data, we get measurements for every parameter on every cell, but we only get measurements for the markers we decide to include in the panel. For instance, with a panel that included only T cell markers, we would struggle to identify different populations of dendritic cells because we just don't have the right information.

In flowcytoscript, clusters are identified based on matching to a simple table in a spreadsheet. This process is adapted from sc-type by IanevskiAleksandr. Here's how it works:

The script asks for information about where your cells come from (tissue, blood), the species and whether you've pre-gated on a specific cell type. This is used to filter the possible matches later.
The script cleans up the marker names, standardizing them from various possible synonyms that people might use. For example, CD134 and OX40 would both become CD134. Foxp3, FoxP3 and FoxP-3 would all become Foxp3 (mouse) or FoxP3 (human). See below.
For each cluster, we take the median expression (MFI) for each marker being analyzed.
For each marker, the MFIs for all clusters are normalized to a scale from 0 to 1. This means that the clusters are effectively ranked in terms of which expresses the most of each marker.
Those scaled expression values for each cluster are scored for how well they match against the descriptions of cell types in the spreadsheet. The spreadsheet lists both markers that are and are not expressed by the cell type. If the cluster has a high expression (near 1) for a matching marker, it counts towards the score; if it has low expression of a marker that should not be present, that also counts as a plus. Conversely, a marker being expressed when it shouldn't be gets the score reduced. The script generates scores for each cluster for every cell type in the database.
The cell type from the database with the highest matching score gets assigned as the name of the cluster.
Next, if there are multiple clusters with the same assigned name (e.g., two follicular B cell clusters), the script tries to determine which markers best differentiate these clusters. It finds the markers with the highest variability between clusters with the same basic name (Follicular B cell) and appends the positively expressed marker(s) to the cluster names until they aren't identical. For instance, we might have two Follicular B cell clusters, but one might have higher expression of CD40 and the other higher expression of CD23. These would end up being named Follicular B cell CD40 and Follicular B cell CD23.
Finally, the script gives you complete control over manually renaming the clusters if you so desire.

Marker name matching and correction

At the moment, there's support for human and mouse. All of the spreadsheets are easily edited to add in new cell types or change the definitions to better suit your panel design. If you can't figure out how to do that, but want changes, ping me an email and I'll add them in.

The names should be considered a good starting point for identifying your cell types. If you don't have the right markers in your panel, accurate identification won't be possible for all cell types. For instance, with a panel including CD4, CD3, CD8, CD11c (or CD123) and MHCII (or HLA-DR), flowcytoscript would be able to identify T cells and dendritic cells. It will also identify another prominent cell type in the lymphocyte gate expressing MHCII (HLA-DR) in the absence of the other markers. As an immunologist, you can say that's probably a B cell. The script doesn't know that, and might suggest B cell, ILC3 or even a monocyte, all of which express MHCII (HLA-DR) but lack CD3, CD4, CD8 and CD123.

Some cases where you may run into issues:

The script will also struggle with unmixing or compensation errors that give artificially high or low expression values. And, if you have uncorrected autofluorescence, for instance causing your eosinophils to appear positive for CD3, this will affect identification. If you've under-clustered, resulting in biologically distinct cell types (e.g., CD4 and CD8) being in the same cluster, this will be hard to identify properly. Finally, the script struggles when the clusters have very few events and the data are noisy.

If your markers aren't in the marker table, you'll need to add them. If your cell type isn't in the cell type spreadsheet, you'll need to add that with the appropriate definition.

Here's an example heatmap output from the automated cluster identification:

In this heatmap from a murine flow dataset, we have the unnamed clusters in columns (labels on the bottom). On the right, we have the cell types in the spreadsheet. Red indicates a high score; pale, a low scoring match. We can see that Cluster-22 scores highest for Neutrophil, followed by similar cell types including Monocyte, Eosinophil and then classical DCs. These all express high levels of CD11b, while lacking CD3, CD4, CD8, NK1.1 and CD19.

Here's the same heatmap without the dendrogram grouping the cell types and clusters:

To a flow cytometrist, it might be easier to look at the data as histograms rather than a heatmap. The script spits out an image like the one below with the expression of each marker for every cluster.

Here's another example from a simpler experiment, cropped so we can see a bit better.

This dataset was pre-gated on Tregs. We can see that there are only two basic cell types (Naive CD4 Treg & Act CD4 Treg) that are generated in steps 1-6 above. The rest are markers that have been appended to differentiate the cluster names. The first Naive CD4 Treg cluster (red) expresses high levels of both CCR7 and CCR9, whereas the second (blue) only expresses CCR7. Hence the names Naive CD4 Treg CCR7 CCR9 and Naive CD4 Treg CCR7.

Finally, if you want to use the script, it's free for academic use. You can find two versions on GitHub: one for people comfortable in R and one intended to be easy to use for those who aren't. Video tutorials for the simplified version are on YouTube. All of these links are also in the Data Analysis section.

Relevant reading and links:

Review on cell type identification in scRNA-Seq

sc-type (GitHub)

sc-type (publication)

Cluster Explorer (FlowJo)

scCATCH

scAnnotatR

Seurat--identify cell types

Colibri Cytometry

What's that cluster?

Recent Posts

2 Comments