0

when I run this code:

sorted_chromosomes=($(printf "%s\n" "${chromosomes[@]}" | sort -u))


echo "chromosomes: ${chromosomes[@]}"
echo "sorted_chromosomes: ${sorted_chromosomes[@]}"


for chrom in "${sorted_chromosomes[@]}" ;

do
    echo "proxy: Running PLINK LD analysis for chromosome ${chrom}." > /dev/stderr

    local out_prefix="${SNAPTMP}/SNAP.${chrom}.proxy"

    plink --bfile "${PLINK_REF_PANEL}/${chrom}" \
        --r2 --ld-window 1500 --ld-window-r2 ${PLINK_MIN_R2} \
        --keep "${PLINK_REF_PANEL_KEEP}" \
        --out "${out_prefix}" \
        --ld-snp-list "${SNAPTMP}/SNAP.input.proxy" \
        > /dev/null #full path to plink wasl added

    retVal=$?
    if [ $retVal -eq 0 ]; then
        perl -pi -e "s/[ \t]+/ /g;" "${out_prefix}.ld"
        perl -pi -e "s/^[ \t]+//g;" "${out_prefix}.ld"
        perl -pi -e "s/[ \t]+$//g;" "${out_prefix}.ld"
    else
        echo "Error: PLINK analysis for chromosome ${chrom} failed." > /dev/stderr
    fi
done

My code works as expected and I get this output :

chromosomes: 1 6
sorted_chromosomes: 1 6
proxy: Running PLINK LD analysis for chromosome 1.
proxy: Running PLINK LD analysis for chromosome 6.

I am trying to parallelize the LD analysis with plink on each chromosome to make my script more efficient. I am trying to use the parallel command, which my server does indeed have installed.

[-----.-------@hydra1 SNAPPY]$ parallel --version
GNU parallel 20191022
Copyright (C) 2007-2019 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.

for reference, I have 10 CPU's available:

[-----.-------@hydra1 SNAPPY]$ nproc
10

Here is the code I am using to try and parallelize my for loop:

sorted_chromosomes=($(printf "%s\n" "${chromosomes[@]}" | sort -u))


echo "chromosomes: ${chromosomes[@]}"
echo "sorted_chromosomes: ${sorted_chromosomes[@]}"


execute_plink(){

    echo "proxy: Running PLINK LD analysis for chromosome ${chrom}." > /dev/stderr
    local chrom="$1" 
    local out_prefix="${SNAPTMP}/SNAP.${chrom}.proxy"

    plink --bfile "${PLINK_REF_PANEL}/${chrom}" \
        --r2 --ld-window 1500 --ld-window-r2 ${PLINK_MIN_R2} \
        --keep "${PLINK_REF_PANEL_KEEP}" \
        --out "${out_prefix}" \
        --ld-snp-list "${SNAPTMP}/SNAP.input.proxy" \
        > /dev/null #full

    retVal=$?
    if [ $retVal -eq 0 ]; then
        perl -pi -e "s/[ \t]+/ /g;" "${out_prefix}.ld"
        perl -pi -e "s/^[ \t]+//g;" "${out_prefix}.ld"
        perl -pi -e "s/[ \t]+$//g;" "${out_prefix}.ld"
    else
        echo "Error: PLINK analysis for chromosome ${chrom} failed." > /dev/stderr
    fi
}

export -f execute_plink

# Run PLINK jobs for each chromosome simultaneously using GNU Parallel
parallel execute_plink ::: "${sorted_chromosomes[@]}"

this code now gives me this output:

chromosomes: 1 6
sorted_chromosomes: 1 6
Error: PLINK analysis for chromosome 1 failed.
Error: PLINK analysis for chromosome 6 failed.

I am really not sure what is going wrong here.

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
pooch
  • 1
  • 1
  • Some general notes, not specific to parallel. `array=( $(anything) )` is buggy; avoid in favor of `readarray -t array < <(anything)` for the reasons given in [BashPitfalls #50](https://mywiki.wooledge.org/BashPitfalls#hosts.3D.28_.24.28aws_.2BICY.29_.29). – Charles Duffy Jun 26 '23 at 16:13
  • 1
    You can pass multiple `-e` arguments to one copy of `perl`; no reason to rewrite your file three times instead of performing all three edits in just one pass. – Charles Duffy Jun 26 '23 at 16:14
  • `foo; rc=$?; if [ $rc -eq 0 ]; then ...` is better written as `if foo; then ...` -- see [Why is testing `$?` to see if a command succeeded or not an antipattern?](https://stackoverflow.com/questions/36313216/why-is-testing-to-see-if-a-command-succeeded-or-not-an-anti-pattern) – Charles Duffy Jun 26 '23 at 16:16
  • None of the above is directly on-point for your question, though. Personally, I'd dodge the issue by not using GNU parallel at all and shifting to `xargs -P`; much simpler, far less going on behind your back, and consequently the things that can go wrong are more up-front and within your direct control – Charles Duffy Jun 26 '23 at 16:17
  • (and btw, think about using `>&2` instead of `/dev/stderr`; unless the version of bash you're using is one that does the conversion internally, the former is considerably more efficient insofar as it avoids an `open()` syscall). – Charles Duffy Jun 26 '23 at 16:19
  • ...oh -- and as always, `set -x` is your friend to get trace logging to stderr. In particular that'll let you compare the plink command line against a known-good one. – Charles Duffy Jun 26 '23 at 16:21
  • In the parallel situation you created a new function. Does this work fine when called in a `for-loop` like `for chrom in "${sorted_chromosomes[@]}" ; do execute_plink "${chrom}";done`? This way you should notice that `execute_plink()` is using `${chrom}` in the first `echo` before assigning it a value. – Walter A Jun 26 '23 at 20:49
  • You want to rule out that `plink` can not run parallel. Since you are testing with the chroms 1 and 6, try adding `sleep $1` as the first statement in `execute_plink()`. – Walter A Jun 26 '23 at 21:08
  • Maybe multiple `plink`s cannot run in parallel? Try `parallel -j1` – Ole Tange Jul 11 '23 at 08:04
  • You use '> /dev/null'. Could it be that `plink` tells you something, that you do not see? – Ole Tange Jul 11 '23 at 08:10

0 Answers0