Data analysis: Scaling your data for dimensionality reduction

Today we're talking about scales. Very boring, very important. The scale of the plots presenting your flow cytometry data are actually critically important in getting good results with high dimensional data analysis, though. The way the algorithms interact with the data depends on how the data are scaled and you can control the way they "see" the data by reshaping it yourself.

Note that I'm far from the first to discuss this, and there are excellent papers on this, including this one from den Braanker et al., this one from Liechti et al. dealing with UMAP, and this one from Quintelier et al. dealing with scaling of data for FlowSOM. Andy Filby also has really nice work in this area with the OPTIMAL imaging mass cytometry pipeline.

In this post we'll walk through some examples of how to scale your data appropriately in FlowJo, and the impact that changing the scale has on tSNE. Unlike some other algorithms, tSNE is actually relatively robust to changes in scale. Frightening, given what we'll see below.

First, why do you want to change the scale? Shouldn't the scale be "correct" already? Well, FlowJo is actually very poor at this, especially with non-BD machines. The recent update makes data transformation customizable per cytometer and panel, but the default when importing data from a new machine is not great. The linked article on data transformation contains excellent references on why biexponential scaling is better for fluorescent flow data than linear or logarithmic.

As an example, here's what happens when we open some data from the Sony ID7000 in FlowJo.

The FSC and SSC appear with logarithmic axes. These are parameter best viewed in linear scale. To change this, we click on the "T" (transform) and change the axes to linear. All the scatter parameters come up as logarithmic by default, so we have to change this to gate our singlets as well. Once we get to the fluorescence parameters, we get logarithmic axes, but these should be biexponential. To change this, we click on the "T", then select "Customize axis". In this pop-up, we can select from several scaling options from a drop-down menu, and we can also adjust the parameters of these transformations. Clicking "Save" applies this transform.

For a biexponential transform, adjusting the width basis is the best place to start. This compresses the negative more or less (check the video to see this in action). For a dimensionality reduction, we want that negative to be taking less (or at least no more) space on the plot than the positives. Remember that if it's truly negative, then there is no information in that space. My rule of thumb is to set the width of the negative to be similar to the width of a lineage gating marker (like CD4) that we also consider to have no meaningful information in its level of expression. Obviously, that's debatable. What you don't want is for the algorithm to "see" the negative as representing meaningful variation in expression if you think there isn't any there.

In the video below, you can see how changing the negative and positive log decades affect the data visualization.

If you need extra negative log decades, that's usually an indication that you have a spillover error or massive spread.

Creating more compression with the positive log decades slider should only be necessary if you have a panel that's pushing the limits of what your machine can handle. This means you'd have lots of positive spreading error or unmixing distortion.

We can also set all the axes quickly to the same scale by selecting them all.

For some markers, though, the expression may be low, and for best results, you'll want to adjust each axis individually.

The staining in today's post is all short surface staining. Siglec F works a lot better overnight.

Let's have a look at a tSNE produced with all the axes scaled the same. Here's how we set up a tSNE in FlowJo:

This is an optSNE version of tSNE and we've got CD14 expression overlaid as a heatmap statistic. Since this particular dataset is all markers that form bimodal expression patterns, we get tidy, well separated islands.

What if we don't scale the data well? If we set the width basis parameter for the biexponential to be -10 instead of -1000, the data changes to look like this:

-1000 -10

And the tSNE looks like this:

We now have lots more blobs! That means more clusters, new cell types, we're going to be famous! But wait, this data only has one marker for human monocytes (CD14, expression overlaid above)--how can we have multiple islands?

We don't. This is all an artefact of setting the scales inappropriately, bifurcating the negative so that it gets pushed and pulled by tSNE into multiple groups. Notably we also don't get the CD14 monocyte cluster separating from other cell groups.

We can also use other data transformations. Arcsinh is commonly used for mass cytometry data analysis and also for flow cytometry data analysis in R. I'm not entirely sure why this is popular because the biexponential transform is available in flowCore, and a direct replica of the FlowJo biexponential transform can be had from flowWorkspace. Ideally, we'd be viewing the data the same way when we perform manual or high dimensional analysis, so if you plan to use Arcsinh in your code, maybe use that for your gating as well.

Biexponential Arcsinh

This looks a lot more like Cytof data. In this visualization, we have sharp cut-offs at the axes. We can expect this to affect the tSNE output.

First, we see a lot more blue in the expression overlay of CD14. This "cold" indicates low expression, which reflects that the data are now piled up on the axis. Visually, that's nice because we get better separation in the color scale.

Again the CD14+ monocytes are broken into multiple blobs. Why is this? What this is telling us is that tSNE is "seeing" something that is different between these blobs. In this case, it's one of the autofluorescence channels that I've included in the analysis.

Here, the colored overlay indicates autofluorescence. Is there actually a difference in expression, though?

No.

What's happened is that all the parameters have been scaled with the same transformation. For autofluorescence, this wasn't an appropriate scale, and the cells are all bifurcating equally and artefactually. Be extra cautious if you're including AF or scatter channels in your dimensionality reduction analysis as these may require quite different scales. As I mentioned above, ideally you want to go through and tailor the transformation for each parameter.

Sometimes groups working with high dimensional flow datasets will hand off the analysis to a mathematician or bioinformatician. Unless you're lucky, these people may not have much experience with flow cytometry data. They might do things like scaling and centering all the parameters to have equal variance. If you looked just at histograms, you might see something like this:

The histograms look like reasonable bimodal expressions, a bit poorly separated. The UMAP plots on the right are a bit compressed, but who knows?

Actually, this is all just spread in the negative of Aurora data that isn't correctly scaled in FlowJo. These samples, from mouse salivary gland, have very few CD19+ B cells, and express essentially no TIGIT.

Looking at biplots, this is a bit easier to see.

Putting the data on a more appropriate scale gives us this big change in appearance:

Now we see well separated islands on the UMAP, which correspond nicely with the colored FlowSOM clusters.

As a final word of caution, on some cytometers you need to export the data in a specific, non-default format in order to use them for dimensionality reduction in R or programs like FlowJo or FCS Express. The Bio-Rad ZE5 has this quirk.

If this all seems really complicated, you may want to check out this simplified data analysis pipeline in R. The scaling is all applied automatically using a biexponential transformation appropriate to the cytometer the data were recorded on. At the moment, the script supports BD machines (not Discover S8 yet), Cytek Aurora and Sony ID7000.

Colibri Cytometry

Data analysis: Scaling your data for dimensionality reduction

Recent Posts