I am using "samtools calmd" to add MD tag back to BAM file. The size of original BAM is around 50Gb (whole genome sequence by using pacbio HIFI reads). The issue that I encountered is that the speed of "calmd" is incredibly slow! The jobs have already run 12 hours, and only 600MB BAM with MD tag are generated. In this way, 50GB BAM will take 30days to be finished!
Here is the code I used to add MD tag (very normal):
rule addMDTag:
input:
rules.pbmm2_alignment.output
output:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"
params:
ref = strRef
threads:
16
log:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
benchmark:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
shell:
"samtools calmd -@ {threads} {input} {params.ref} -bAr > {output}"
The version of samtools I used is v1.10.
BTW, I use 16 cores to run calmd, however, it looks like the samtools is still using 1 core to run it:
top - 11:44:53 up 47 days, 20:35, 1 user, load average: 2.00, 2.01, 2.00
Tasks: 1723 total, 3 running, 1720 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.3%sy, 0.0%ni, 96.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529329180k total, 232414724k used, 296914456k free, 84016k buffers
Swap: 12582908k total, 74884k used, 12508024k free, 227912476k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
93137 lix33 20 0 954m 151m 2180 R 100.2 0.0 659:04.13 samtools
May I know how to make calmd be much faster? Or is there any other tool that can do the same job more efficiently?
Thanks so much