The peculiarities of Phenograph

Phenograph is one of the more popular clustering algorithms for flow cytometry. Let's look today at a couple of odd things that can happen with it so you can be on the watch for this behavior when running your analysis.

TLDR: Use the Rphenograph implementation if you're using R, fix a random seed for reproducibility and use at least 100,000 cells for input if you can.

To explore these Phenograph oddities, today we'll be using example data from OMIP-102, pulling out the CD4+CD3+ T cell population for clustering and optSNE-like tSNE visualization.

Cell number

A neat feature of Phenograph is its ability to cluster your cells without you telling it how many to look for. This is nice because, frankly, the alternative of searching for clusters based on a number in your head is not a great way to go about it. So, a ready-made solution that analyzes your data and tells you what's going on sounds great.

There are a couple of issues with this. First, Phenograph starts with a k-nearest neighbors network, and if there aren't a lot of neighboring cells, the distinction becomes more difficult. In short, what happens is that the number of clusters you get back depends on how many cells you put in. Not great.

Running Rphenograph on increasing cell numbers from the same data

From this plot, we can see that curve is likely going to level out, so you can probably solve this in practice by always running lots of cells.

For the plot above, I've taken the first 500, 1000, 5000, 10000, 50000 or 100000 cells from the data, and have run Rphenograph on the same markers. The same random seed is set before each run. Example of the code below for 500 cells:

# downsample
input.500 <- omip102.cd4.data[ 1:500, ]
# run phenograph
set.seed( 42 )
rpheno.500 <- Rphenograph( as.matrix( input.500[ , ..clustering.cols ] ), 			 
	k = 30 )
input.500$Cluster <- factor( membership( rpheno.500[[ 2 ]] ) )

What's going on here? If we look at the tSNE plots, we can see that not only does the total number of clusters change, but the cells change which clusters they belong to. The big cluster on the upper left changes from one to two, and the bottom blob splits and then fuses again.

The tSNE here is run on all the cells (same markers) independently of the Phenograph algorithm, so we can map each cell to the same position on the tSNE with each clustering run. Apologies for the colors, but color schemes are kind of difficult for lots of divisions.

Setting a random seed

Phenograph is supposed to be deterministic, meaning it'll always produce the same results from the same input data and settings. I can't find any mention of a need to fix a random seed in the documentation for the R packages, but there are some discussions of this online. This is odd, because setting different random seeds produces different results. It's possible that Phenograph takes care of this internally, but I'd suggest setting the random seed yourself to make sure you can repeat your analyses later and get the same results.

Here's a plot of all the cells that have changed cluster affiliation between two runs of Rphenograph using different fixed seeds.

That's 2108 out of 5000 cells, or over 40%. The same thing happens when more cells are used.

FastPG

FastPG is a massive improvement speedwise on the R implementation over Rphenograph. It's a little more complicated to install, but the speed-up is pretty insane.

If you're using FastPG in R, there's another issue you should be aware of, which is a matter of reproducibility. Thanks here to Peter Leary for pointing out that FastPG didn't always give him the clusters in the same order, despite setting the same random seed. In exploring this, it's actually a bit more than that.

Here's what FastPG gives as the number of clusters for the exact same input across thirty runs.

Running FastPG 30 times gives variable clustering

So, we have somewhere between 16 and 26 clusters, apparently. We'd probably get a wider range with more runs. This likely has to do with the random seed being generated in C++ and not passed from R, but that's beyond me. The Phenograph algorithm is supposed to be deterministic, so in that respect, it's important to note that FastPG is explicitly "Phenograph-like".

Here's the same thing for Rphenograph:

Running Rphenograph 30 times gives reproducible results

Looking at the tSNE, here are two examples of FastPG doing its thing:

Yep, pretty different results.

One factor that that I thought might play into this is how FastPG calculates the knn graph. It uses a very fast implementation with RcppHNSW::hnsw_knn() that gives an approximation with the default parameters used. What do this approximation mean? With knn, we're trying to sort cells (or points) into how similar they are, creating networks of neighbor relationships. With the approach FastPG uses, sometimes a given cell isn't the closest cell to itself, which makes no sense.

In testing this, however, it seems not to be the case. We can calculate the knn separately using any number of approaches, some of which allow more control over the algorithm. One option is to use the uwot package to generate a UMAP and return the knn from that. This cheat allows you to generate a UMAP embedding at the same time, it's reasonably fast, and you can control both the knn method and distance metrics (for example, using cosine distance instead of Euclidean, if you want).

umap.result <- uwot::umap( as.matrix( input.50k[ , ..clustering.cols ] ),
	n_neighbors = 30, 
	# optionally set metric = "cosine" or "euclidean" (default)
	n_epochs = 500,
	n_threads = n.threads,
	n_sgd_threads = n.threads,
	seed = 42,
	ret_nn = TRUE )

ind <- umap.result$nn$euclidean$idx

# Parallel Jaccard metric
links <- FastPG::rcpp_parallel_jce( ind )
links <- FastPG::dedup_links( links )

# Parallel Louvain clustering
clustering.result <- FastPG::parallel_louvain( links )
clusters <- factor( clustering.result$communities )

We still get variable numbers of clusters out. There don't seem to be any options to control the Louvain clustering from R in FastPG.

Running FastPG clustering, but using uwot to calculate the knn

And the clustering boundaries are still quite variable, even starting from the exact same knn calculation.

Run 1 of FastPG clustering using Euclidean distance knn calculation from uwot

Run 2 of FastPG clustering using Euclidean distance knn calculation from uwot

Switching to cosine distance in uwot doesn't help either.

Run 1 of FastPG clustering using cosine distance knn calculation from uwot

Run 2 of FastPG clustering using cosine distance knn calculation from uwot

The final thing I'm going to point out about Phenograph is that it tends to split groups of cells that are biological similar, at least it does this more than FlowSOM with consensus clustering. The big blob on the lefthand side of the tSNE is the naive conventional CD4 T cells. tSNE gives a pretty good agreement with my opinion as an immunologist that these are a single group without big distinctions. Phenograph breaks this up into multiple clusters.

If you are going to use Phenograph, one way that it can be helpful is to get a sense of roughly how many clusters there might be in your data and then use this as a target for FlowSOM clustering. Or you can just go straight to FlowSOM.

The implementations of Phenograph used here are:

JinmiaoChenLab/Rphenograph: Rphenograph: R implementation of the PhenoGraph algorithm

sararselitsky/FastPG: Fast phenograph, CyTOF

The original publications are here:

FastPG: Fast clustering of millions of single cells | bioRxiv

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis: Cell

You might also be interested in: Cluster stability in the analysis of mass cytometry data - Melchiotti - 2017 - Cytometry Part A - Wiley Online Library

Colibri Cytometry

The peculiarities of Phenograph

Recent Posts