Data Analysis: EmbedSOM's party trick

Tired of trying to waiting for tSNE to run on large data sets? What if we could run any dimensionality reduction approach, but in a fraction of the time? Well, with EmbedSOM from Mirek Kratochvil, you can. Let's look at how this works.

The basic approach of EmbedSOM is to create an embedding (dimensionality reduction) based on a self-organizing map (SOM) of the data. We saw two different versions of this in the post on dimensionality reductions, both the standard SOM and the improved growing quadtree (GQT) SOM. In all these cases, the embedding is generated using landmarks (randomly sampled cells) from the data, using these to fix positions in the dimensionality reduction and map the other cells around them (please correct me if this simplified explanation is wrong). What we can do, however, is use landmarks from any dimensionality reduction algorithm we want. This speeds up the embedding process massively because for the slow part (in say tSNE) we only need about 1000 cells to generate the landmarks; after that, the rest of the cells can be mapped to this tSNE very, very quickly using EmbedSOM.

How does this work practically? You'll need to be using R. Let's run through some examples.

First, here's a reminder of how to create a standard embedding with EmbedSOM:

flow.som <- EmbedSOM::SOM(input.data, xdim = 24, 
                            ydim = 24, batch = TRUE,
                            parallel = TRUE, threads = 0 )

embed.som <- EmbedSOM::EmbedSOM( data = input.data, map = flow.som, 
                            parallel = T )

First, we create a flowSOM map, then use that to create the embedding. Here's the result, with flowSOM-generated clusters colored. As a reminder, these cluster names have been generated automatically based on marker expression (detailed here).

An EmbedSOM

Alternatively, here's the QGT version:

gqt.map <- EmbedSOM::GQTSOM(input.data, target_codes=1000, 
                              radius=c(10,.1), rlen=15, parallel=T)

embed.gqt <- EmbedSOM( input.data, map = gqt.map,
                         parallel = T, threads = 0 )

A GQT-EmbedSOM

The separation seems a bit better.

Now, let's look at how to use other dimensionality reductions as landmarks. For tSNE and UMAP, there are built-in functions in the EmbedSOM package to do this:

tsne.map <- RandomMap( input.data, 2000, 
                       coords = tSNECoords(perplexity = 30,
                                 check_duplicates = FALSE,
                                 pca = FALSE, 
                                 max_iter = 750, stop_lying_iter = 75,
                                 eta = 2000, exaggeration_factor = 4,
                                 num_threads = 0 ))

embed.tsne <- EmbedSOM( input.data, map = tsne.map,
                          parallel = T, threads = 0 )

What we're doing here is running a tSNE on 2000 cells, then using this as a map to set up the embedding. As you can see, we can specify all the same parameters as we would normally for running the tSNE.

And the result looks a lot more like a tSNE! Only, this completed in 5.5 seconds on my laptop, whereas the tSNE took 5.5 minutes.

An Embed-SNE

Unlike the basic EmbedSOM plots above, the cells are much more compressed into separate islands like they are in a tSNE. This makes sense because we've used the tSNE of 2000 cells to decide where to put the other cells based on similarity. Unlike a real tSNE (below), there are cells trailing between islands that presumably haven't mapped well to any landmarks.

A real tSNE

Note that we shouldn't expect the clusters to be in similar positions on the graph because we haven't started with the same input (2000 cells versus all cells), and the initial steps are randomish.

Similarly, we can use UMAP to initiate the embedding.

umap.map <- EmbedSOM::RandomMap( input.data, 2000, 
                                   coords = EmbedSOM::UMAPCoords() )

embed.map <- EmbedSOM( input.data, map = umap.map,
                         parallel = T, threads = 0 )

This took about 10 seconds to calculate instead of about 5 minutes.

An EmbedMAP

Reminder, this is what a real UMAP looks like:

Only tSNE and UMAP are supported directly with built-in functions. What if we want to do something else? All we need to do is replicate the structure of the flowSOM map. This is a list containing two elements: 1) the high dimensional data for each cell, and 2) the dimensionality reduced data for the same cells.

Let's have a look at this with PHATE:

# select 2000 cells to serve as landmarks
landmark.data <- input.data[sample(nrow(input.data), 2000), ]

# run PHATE on these 2000 cells
phate.landmarks <- phate(landmark.data, n.jobs = -1, seed = date.seed)

# create a list with both things together
phate.map <- list()
phate.map$codes <- landmark.data
phate.map$grid <- phate.landmarks$embedding

# create the embedding using the PHATE landmarks
embed.phate <- EmbedSOM( data = input.data, map = phate.map,
                         parallel = T, threads = 0 )

An EmbedPHATE

A real PHATE

Both plots have the cells seeming to radiate out in branches from a central point, which is the idea behind PHATE. The branching structure is clearer in the real PHATE, not surprisingly. Notably, PHATE is slow to calculate the KNN, taking almost 7 minutes. The EmbedSOM version? 25 seconds.

DenSNE gave a really nice representation of the data, compressing areas of cells with low diversity in phenotypes and dedicating more space to cells with greater divergence.

DenSNE

DenSNE didn't run in parallel, however, and so it took almost 37 minutes to run on this data. With EmbedSOM's cheap trick, we can run this in 19 seconds.

An Embed-DenSNE

Unfortunately, here I think we've lost most of that density preservation that occurs in the denSNE algorithm. Probably this would work better if we used more landmarks, which would come at the expense of speed.

Finally, here's an EmbedSOM embedding based on PaCMAP landmarks:

Where might we use this trick? I think that any time we have lots of data and we want to do an exploratory analysis, this would be a great starting point. We can actually get a reasonable visualization of the data in a much shorter calculation time, and we can adapt the visualization to our favorite algorithm, allowing us to preserve different features of the data (local structure, global structure, perhaps density). This trick makes it less intimidating to work with large datasets.

Colibri Cytometry

Data Analysis: EmbedSOM's party trick

Recent Posts

Commentaires