2

I am trying to utilise the vcftools package to calculate weir and cockerham's fst. I would like to loop over two pairs of populations in the first instance and then loop these populations across all variants from the 1000 Genomes project: each chromosome contains a separate vcf file. For example, for pop1 vs pop2, for pop3 vs pop4 calculate fst for chromosomes 1-10. Each population file, for example, LWKfile contains a list of individuals that belong to this population.

I have attempted:

for population in LWK_GBR YRI_FIN; do

firstpop=$(echo $population | cut -d '_' -f1)
secondpop=$(echo $population | cut -d '_' -f2)

for filename in *.vcf.gz; do

vcftools --gzvcf ${filename} \
--weir-fst-pop /outdir/${firstpop}file \
--weir-fst-pop /outdir/${secondpop}file \
--out /out/${population}_${filename}

done

done  

However this does not loop through all the files and seems to get stuck on chromosome 10. Is there a more efficient way to perform this in bash as I am concerned the loop within loop will be too slow.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
DeanR
  • 21
  • 1
  • First, quote the variables. If a `filename` contains spaces `vcftools` will break. Write `"$filename"` instead of `$filename`. Apart from that, I don't see a problem. Especially the loops seem fine. If you keep getting errors maybe there's a problem with `vcftools` or the files. – Socowi Dec 31 '20 at 13:14

1 Answers1

0

However this does not loop through all the files and seems to get stuck on chromosome 10.

I am concerned the loop within loop will be too slow.

Are you sure that it is the for filename in *.vcf.gz which is too slow to loop over all files?

Try to put an echo before vcftools to see if it remain stuck or not.

You need to be sure on what takes too much time in order to be able to make the right choice.

For example if it's vcftools maybe you don't need to wait the end of this command and think about doing some asynchronous treatments.

If there is too much file for one loop, you should also consider make some parallel treatments.

Also, you seems to repeat the loop over all .vcf.gz files twice. It will be probably a little bit more quick to reverse your two loops.

Here is an example with parallel and async treatments using bash:

#!/bin/bash

MAX_PARALLEL_PIDS=4 # adjust regarding your own machin capacity (cpu available, etc... it could be dynamically calculated)

declare -a POPS
declare -a PIDS

POPS=("LWK_GBR" "YRI_FIN")

# your heavy treatment in a function
process() {
  pop="${1}"
  filename="${2}"
  firstpop="${pop%%_*}" # no need to call an external program here
  secondpop="${pop#*_}" # same here

  vcftools --gzvcf "${filename}" \
     --weir-fst-pop "/outdir/${firstpop}file" \
     --weir-fst-pop "/outdir/${secondpop}file" \
     --out "/out/${pop}_${filename}"
}

# a function which is usefull to wait all process when your "thread pool" reached its limits
wait_for_pids() {
  for pid in "${PIDS[@]}"; do
    [[ $pid =~ ^[0-9]+ ]] && wait $pid
  done
  unset PIDS
}

i=0
for filename in *.vcf.gz; do
 if [[ $i -ge $MAX_PARALLEL_PIDS ]]; then
   i=0
   wait_for_pids
 fi

 for population in "${POPS[@]}"; do
   process "${population}" "${filename}" & # You won't wait for the end here
   PIDS[$i]=$!
   (( i++ ))
 done
done

# at the end wait for the remaining processes
wait_for_pids

N.B: Putting aside the variables inside a [[ conditions, you should pay attention on quoting your variables that can contain some spaces, especially the files names for example. It will break otherwise.

Idriss Neumann
  • 3,760
  • 2
  • 23
  • 32