Correcting RNA-seq dataset for known batch effect

Question

I'm analyzing an RNA-seq dataset where a human cell line has been exposed to multiple chemical compounds at multiple doses. When running QC I have noticed the presence of a batch effect due to the different plates the cells were treated (not a strong one but would like to account for it). I have used both ComBat and removeBatchEffect from the limma package to look if any of the two methods was better in removing the batch but as you can see from the PCAs obtained on the control samples for each of the normalization steps (raw data, vst, ComBat and Limma) it seems that using either of the two methods increase the batch separation.

PCAs on control samples at the different normalization steps to highlight batch effect:

My feeling is that I may have made a mistake when specifying the arguments of the functions but I have come out with this piece of code when looking at similar request on StackOverflow. The code I have used for producing the different normalized dataset on which I run a PCA is:

raw data: raw<-counts(dds, normalized = TRUE)

vst: vst_counts<-vst(dds, blind=TRUE)

ComBat: com<-ComBat(assay(vst_counts),hash$Plate, mod = model.matrix(~1, data = hash))

Limma: lim<-removeBatchEffect(assay(vst_counts),hash$Plate,design=model.matrix(~hash$group))

Raw data and vst data are obtained from DESeq. The hash object is my metadata file containing information about the plates (batch) and the treatment conditions (group).

The code for running the PCA (for a single dataset) was:

ggplot(df_lim) + geom_point(aes(x=PC1, y=PC2, color = Plate),size=3) +
xlab(paste("PC1 (",summary(pca_lim)$importance[2,1]*100,"%)")) +
ylab(paste("PC2 (",summary(pca_lim)$importance[2,2]*100,"%)")) +theme_bw() + 
coord_fixed()

Any help in addressing the issue here is highly appreciated.

Batch-file tag is all about starting programs and copying files on Windows. — , Apr 29 '20 at 08:07
It was meant to be just "batch-effect" but apparently that tag does not exist so it turned it into "Batch-file" by itself. I have edited that. Thanks for pointing it out. — Bithorax, Apr 29 '20 at 08:13
might be more suitable for https://bioinformatics.stackexchange.com/ — StupidWolf, Apr 29 '20 at 11:17
Hi @StupidWolf. Thanks for your comment. Yes I'm aware of that. I was not expecting to see batch correction with vst. What I would not expect to see is a greater separation of the batches when running either combat or limma on the vst normalized data. — Bithorax, Apr 29 '20 at 12:46
It's a bit weird. the code looks ok, at least for limma. Do you have a small subset of dataset to share? Otherwise it's hard to know what went wrong — StupidWolf, Apr 29 '20 at 13:28
Most likely you have checked this, the other possibility is that the plate column and the group column is swapped in the plotting because i cannot tell how the PCA is plotted. This would make sense because you actually regressed out the group effect, leaving behind the plate effect — StupidWolf, Apr 29 '20 at 13:33
Thanks for your support. Unfortunately, I'm not in the position to share the data. However, I don't think the columns were swapped ( see the code, I have updated the post). Are you suggesting that the design in the removebatcheffect function should be: design=model.matrix(~hash$Plate)? — Bithorax, Apr 29 '20 at 13:54
No you got it correct. What I meant is, from what you see in the PCA, it means the effect that is remaining, is the plate. But it should be regressed out when you do the batchcorrection. So is it plotted wrongly? — StupidWolf, Apr 29 '20 at 14:04
I see your point. I can better investigate whether something went wrong while plotting even though I don't think so. Thanks again for your support. — Bithorax, Apr 29 '20 at 14:15

Correcting RNA-seq dataset for known batch effect

0 Answers0