HVG selection

Learn how to select Highly Variable Genes (HVGs) in single-cell data, set corrected variance thresholds, and use custom blocklists to refine your analysis while ensuring meaningful cell heterogeneity.

Selecting HVGs

In single-cell data and most other count data, the mean value of features (genes) is correlated with its variance. This effect is called the mean-variance trend. The corrected variance of each gene is obtained by removing this mean-variance trend so that they are not entirely correlated.

Parameters

‍

‍Every point in this plot is the gene. The corrected variance is on the Y-axis of the plot (more info in the next step). The mean expression of the gene is on the X-axis. The x and y axes' values are in the log scale.

The shaded region in the plot indicates excluded genes. The colour of a gene shows the number of cells a gene is present in.

πŸ’‘You can hover over individual data points on the plot to see the name of the gene.

The mean expression of genes is an indicator of how highly it is present across the cells where it is present. Practically, the genes which are ubiquitously expressed, like GAPDH, MALAT-1, and some ribosomal genes, often have a high mean expression as well, and it could be useful to exclude them from the HVG list.

Please use the right handle of the slider below and drag it to the left to exclude the genes that have a high mean expression.

The corrected variance parameter can be used to set the lower and upper threshold, and the genes with values between these bounds are regarded as HVGs (if they also meet other constraints).

❗Critical parameter
It is essential to exclude the genes present in very few cells from the HVG list; otherwise, we risk including genes with noisy expression.

‍How to best set this value?

A good rule of thumb is to set this value about half of the size of the smallest expected cluster in the data. So, for example, if you have a dataset with 10,000 cells and your rarest cell population of interest is present in 1% abundance, i.e. 100 cells, then you should set this value to 50. Higher values, for example, more than 100, will risk losing all the genes specific to that population from the HVG list. Hence, that population will not be distinctly captured in a later analysis.

‍

Find genes manually

You can search for and label specific genes on the HVG plot. The label will show the gene symbol and indicate whether the gene is a HVG.

πŸ’‘ Click and drag the labels on the plot to reposition them

‍

Block list
πŸ¦„ Unique feature

You can manually block genes from being selected as HVGs. This override functionality can allow you to remove genes that might contribute to uninteresting batch effects like cell cycle effects.

Search for genes to block

By default, we blocklist several histones, ribosomal, mitochondrial, and HLA genes. We often also find it helpful to block genes from sex chromosomes that might contribute to the segregation of cells based on the sample of origin. This default list was designed mainly for human or mouse datasets, therefore if you have data from other species, you can provide a custom list of genes.

‍

Custom Blocklist

Here you can provide your own list of genes to be blocked. For custom lists, we expect gene symbols to be separated by comma, space or each gene on a new line.

❗️When you use a custom blocklist, genes from the default blocklist will not be used!

πŸ’‘Tips

- If you want to use the default blocklist but also add to it your custom blocklist:

- Copy and paste the default blocklist of genes from the Search for gene groups to block option into a file or the textbox in the Custom blocklist option, then add your own list of genes to it. You can use a combination of the accepted separators but make sure to check if the number of genes in the final list adds up correctly.

‍

Providing a custom list of HVGs

‍
πŸ’‘ When working with genes from different species, e.g. mouse genes to human genes, it's important to identify the proper orthologous genes (genes that share ancestry and typically serve similar functions across species).

Matching gene symbols might not always be reliable, as genes with identical symbols in different species aren't necessarily equivalent. Instead, use established databases to find true orthologs between species.

‍

πŸ’‘ Best practices for selecting HVGs

How many genes do you select as highly variable genes?

There is no hard and fast rule about how many genes to select as HVGs. Β However, there are some good general guidelines to follow:

- Your results should be a little robust to the number of the HVGs selected. So, for example, if you chose 500 HVGs, you should get generally similar results when selecting between 300 and 700 HVGs. In other words, your results should not be 'over-fitted'.

- Generally speaking, selecting a higher number of HVGs can improve the identification of nested cell types and cell states and, overall, reveal deeper heterogeneity in the data. However, the trade-off here is that you also risk more noise in the data, where you might start seeing less well-defined and small spurious clusters.

- Anecdotally, values between 500-2000 HVGs are sufficient to capture broad heterogeneity in the data. The deeper and nested hierarchy/heterogeneity in the data can be explored further by subsetting the data [link to tutorial coming soon].

Are genes not present as HVGs removed?

No. These genes are not considered for inclusion in the HVGs list, which means that such non-HVGs will not influence the data's clustering and embedding (UMAP/tSNE). However, these can always be visualized over final results and can often be detected as markers for clusters in the later steps.

Are genes present in the block list removed entirely from the data?

No, they are not. They are just not included in the HVG list.

‍

Yi Su

Bioinfomatician