Data Integration and Batch Correction

Learn about managing batch effects in single-cell data analysis. This guide covers best practices for integration, normalisation strategies, and using Harmony for effective batch correction.

min read

Written by

Parashar Dhapola

One of the main challenges in analysing single cell data includes batch effects. Batch effects are changes in expression levels which lead to unwanted grouping of cells. This can be a result from technical and protocol variations, sample processing, and sometimes biological variation in donors which are not of interest, can also be considered. Pre-processing steps such as normalization can reduce some of the differences between cells, such as sequencing depths but as single cell technologies and data become more ubiquitous and accessible, integration will be crucial in the analysis of data across a large number of datasets given their complexity.

Here, we will provide a guide on best practices for data integration and an example of steps you can apply using the platform today.

‍

Best practices when integrating multiple batches

1) Run the first analysis without any batch correction.

2) On explore page, see how the batches separate, run a differential gene expression between batches. Do differentially expressed genes come from some specific pathways? If yes, there might be a confounding biological difference between these batches.

3) You can then run a second analysis, this time blocklisting the genes from HVGs, that were differentially expressed between the batches (genes with score > 0.75 or lower if you have many batches).

4) Now you can evaluate the results on the explore page, do you see a difference? If not, its like that batch effect is deeply confounded and removal of individual genes will not help.

5) Then you run the analysis a third time, with Harmony. You should choose coefficients that were most associated with the batch effects that you observed in the explore page. Try to use as few coefficients as possible. Make sure that you use covariates in a way that it doesn't confound your interpretation, for example, if you want to say that treatment made no difference, then don't use treatment coefficient as a correction coefficient for Harmony.

6) From here, on you can test with different coefficients and interpret your results accordingly.

‍

Example: Batch correction steps on Nygen Analytics

‍

Example steps:

Step 1. Run the first analysis without any batch correction and check for any batch effects.

Step 2. You can first check for technical batching is you have added metadata such as lane, experiment, assay or any sample processing batches. Sometimes, there might be variations in donors or biological sex which creates a batching effect that is not of interest in your experiment.

Step 3. For this example, we find that data from different technologies (Seq-Well and 10x) are separated in the analysis.

Step 4. The first strategy is to try and block the genes that are causing this separation. First, we will run a differential gene expression analysis for the two batches we would like to integrate.

Step 5. Once the differential expression analysis has completed, we will start a new analysis.

Step 6. From the analysis pages, go to Selecting HVGs step. We will use the custom blocklist to block the genes that were highly differentially expressed between the batches. 💡See more on using blocklists here

Step 7. Add the genes into the textbox, and run the analysis again.

Step 8. Once the second analysis is complete, check if the batch effect has been corrected. If it hasn't been resolved, like our example here, we can run a third analysis but this time we will incorporate the use of Harmony into our analysis.

Step 9. Start a new analysis and from the Additional parameters analysis page, select the covariate you would like to correct for. For this example, we will choose to correct for 'assay'. You may select one or more covariates to run a batch correction on the data. It is best to limit the number of covariates chosen to the biggest contributors to the batch effect.

Step 10. Once the analysis has completed, you can check for batch effects again and run the correction for other covariates if required.

‍

Normalization method employed by the platform

Does the platform use a dedicated depth normalization strategy?

Yes and no. Behind the scenes, we use the tool Scarf to analyze the single-cell datasets. Unlike many other tools which use, for example GLM based or quantile normalization-based strategy to account for varying depths, Scarf doesn't employ a depth correction step.

However, Scarf, does a two-step normalization. The first normalization is employed like Scanpy, where the normalized value for a gene in a cell is basically the raw count of that gene in that cell divided by the total counts from that cell. We then employ the same step again, a second time, right before calculating PCA.

💡Read more on the Scarf paper here

Why do we perform a second normalization?

The seconds normalization step scales all the counts again so that the sum of normalized counts for HVGs is the same in all the cells. Since this normalization step is dependent on which genes are marked as HVGs, they are dynamically calculated for each analysis run.

For example, in a batch of cells GAPDH was somehow deeply sequenced, and in another batch, GAPDH was shallowly sequenced. Since GAPDH is unlikely to make it to HVGs, all the normalized counts from batch 1 will be generally lower than in batch 2.

Can I do a better HVG selection myself?

Absolutely, we have recently completely reworked the HVG selection step, giving you the ability to provide a fully custom list of genes as HVGs.

💡Learn more about selecting HVGs [link to Selecting HVGs page]

Does the double normalization removed the need for batch correction?

Not always. You still might have genes that are present only in 1 batch, and these will contribute to batch effect. So, it’s not really just the depth of the sequencing but also the diversity of genes that were captured in each batch. Hence, you might still need Harmony for batch correction.

‍

Batch correction (Horizontal data integration) through Harmony

When you merge different samples together or have added cell annotations in your previous analysis, you will see the option to select covariates. By default, batch correction is not applied to your data.

You may select one or more covariates to run a batch correction on the data.

What does batch correction do?

Harmony is an optimization algorithm that operates on PCA-reduced data dimension to minimize the distances between cells attributable to the chosen covariates. Hence, when batch correction works successfully, the obtained clusters will not generally separate based on the chosen covariates.

💡 Read more on Harmony paper here

Is there any risk or disadvantage of batch correction?

If your biological signal is related to the batch information, then you risk losing the underlying biological information.