1

This bash script is meant to be part of a pipeline that processes zipped .vcf file that contain genomes from multiple patients (which means the files are huge even when zipped, like 3-5GB).

My problem is that I keep running out of memory when running this script. It is being run in a GCP high mem VM.

I am hoping there is a way to optimize the memory usage so that this doesn't fail. I looked into it but found nothing.

#!/bin/bash

for filename in ./*.vcf.gz; do
    [ -e "$filename" ] || continue 
    name=${filename##*/}
    base=${name%.vcf.gz}
    bcftools query -l "$filename" >> ${base}_list.txt
    for line in `cat ${base}_list.txt`; do 
        bcftools view -s "$line" "$filename" -o ${line}.vcf.gz
        gzip ${line}.vcf 
    done
done
NHellmann
  • 25
  • 6
  • Can you point out at which line you are running out of memory? Is it `bcftools query`/`view`, `gzip`, or ```for line in `cat ...` ```? – Socowi Jan 21 '21 at 22:05
  • Unrelated to the actual problem; `[ -e "$filename" ]` is most likely useless. `*.csv.gz` lists only files and directories that exist. `-e` checks if `$filename` exists. If you want to ensure that `$filename` is a file and not a directory then use `-f`. – Socowi Jan 21 '21 at 22:28
  • 1
    @Socowi "`*.csv.gz` lists only files and directories that exist" => not if the pattern doesn't match any file (unless `shopt -s nullglob`). In that case, you will get `filename=*.csv.gz` and this is the reason of the `-e` test IMO. If filename should point to a file, then `-f` test would be better though. – xhienne Jan 22 '21 at 00:26

2 Answers2

4

If you run out of memory when using bcftools query/view or gzip look for options in the manual that might reduce the memory footprint. In case of gzip you might also switch to an alternative implementation. You could even consider switching the compression algorithm altogether (zstd is pretty good).

However, I have a feeling that the problem could be for line in `cat ${base}_list.txt`;. The whole file ..._list.txt is loaded into memory before the loop even starts. Also, reading lines that way has all kinds of problems, like splitting lines at whitespace, expanding globs like * and so on. Use this instead:

while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done < "${base}_list.txt"

By the way: Are you sure you want bcftools query -l "$filename" >> ${base}_list.txt to append. The file ${base}_list.txt will keep growing each time the script is executed. Consider overwriting the file using > instead of >>.
However, in that case you might not need the file at all as you could use this instead:

bcftools query -l "$filename" |
while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done
Socowi
  • 25,550
  • 3
  • 32
  • 54
0

You can try to use split on each file (into a constant size) and then gzip the file splits.

https://man7.org/linux/man-pages/man1/split.1.html