-1

I am working on processing a long list of data from a file using bash. There are over 300,000 lines in this file, so using gnu parallel could cut down processing time significantly.

In addition to the main data file, I am using a second, smaller file that contains data that will be used by each iteration of my code. This file contains approximately 60,000 lines, with each line containing two columns. My current strategy is to read each line of the smaller file and copy the data from the columns into two separate arrays. These two arrays are then used in each iteration of the code.

I cannot seem to get gnu parallel to read my arrays as actual arrays, however, despite following the code illustrated in bash how to pass array as an argument to a function , and numerous other permutations of said code.

A simplified version of my code is below. So dar, it only returns a bunch of blank lines. I would very much appreciate it if someone could explain exactly how to parse arrays into parallel.

SCAFF_LENGTH_FILE="${HOME}/ReferenceSequences/P.miniata.Scaffold.lengths.txt"
INPUT_VCF="${HOME}/data/HaplotypeCalling/variants_allOvarySamples.filtered.vcf"
declare -a array_scaffName
declare -a array_scaffLength
z=0
while read -a data LINE; do
    array_scaffName[$z]=${data[0]}
    array_scaffLength[$z]=${data[1]}
    z=$(( $z + 1 ))
done < ${SCAFF_LENGTH_FILE}

WORKING_DIR="${HOME}/*filepath*/codeTest"
TEMP_FILE_DIR="${WORKING_DIR}/TEMP_FILES"
cd $WORKING_DIR
function exon_parse {
    FILE_NUMBER=$1
    TEMP_FILE_DIR=$2
    INPUT_VCF=$3

    scaffName=$4[@]
    scaffName_array=("${!scaffName}")

    scaffLength=$4[@]
    scaffLength_array=("${!scaffLength}")

    echo ${scaffName_array[4]}
    echo ${scaffLength_array[4]}

    }
export -f exon_parse

seq 5 | parallel exon_parse {} $TEMP_FILE_DIR ${INPUT_VCF} array_scaffName array_scaffLength

NB: I use the code seq 5, because my main data file has been broken down into smaller sub-files to aid processing. I would ultimately like to developed nested gnu parallel script that selects each sub-file in parallel, and then uses a code like:

cat fileName | parallel 'processes' {} other_inputs

to process the lines of data within each sub-file in parallel

Community
  • 1
  • 1
gwilymh
  • 415
  • 1
  • 7
  • 20

1 Answers1

1

The most obvious solution is to move the array inside the function:

INPUT_VCF="${HOME}/data/HaplotypeCalling/variants_allOvarySamples.filtered.vcf"

WORKING_DIR="${HOME}/*filepath*/codeTest"
TEMP_FILE_DIR="${WORKING_DIR}/TEMP_FILES"
cd $WORKING_DIR
function exon_parse {
    FILE_NUMBER=$1
    TEMP_FILE_DIR=$2
    INPUT_VCF=$3
    SCAFF_LENGTH_FILE="${HOME}/ReferenceSequences/P.miniata.Scaffold.lengths.txt"
    declare -a array_scaffName
    declare -a array_scaffLength
    z=0
    while read -a data LINE; do
        array_scaffName[$z]=${data[0]}
        array_scaffLength[$z]=${data[1]}
        z=$(( $z + 1 ))
    done < ${SCAFF_LENGTH_FILE}

    scaffName=$array_scaffName[@]
    scaffName_array=("${!scaffName}")

    scaffLength=$array_scaffLength[@]
    scaffLength_array=("${!scaffLength}")

    echo ${scaffName_array[4]}
    echo ${scaffLength_array[4]}

    }
export -f exon_parse

seq 5 | parallel exon_parse {} $TEMP_FILE_DIR ${INPUT_VCF} array_scaffName array_scaffLength

But you can import the array, too:

                import_array () {
                  local func=$1; shift;
                  export $func='() {
                    '"$(for arr in $@; do
                          declare -p $arr|sed '1s/declare -./&g/'
                        done)"'
                  }'
                }

                declare -a indexed='([0]="one" [1]="two")'

                import_array my_importer indexed

                parallel --env my_importer \
                  'my_importer; echo "{}" "${indexed[{}]}"' ::: "${!indexed[@]}"
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thank you Ole. I have updated my code as follows: `import_array () { local func=$1; shift; export $func='() { '"$(for arr in $@; do declare -p $arr|sed '1s/declare -./&x/g' done)" } ' } declare -a array1='( "apples" "oranges" "pears" )' import_array my_importer array1 seq 0 2 | parallel --no-notice -j 1 --env my_importer 'my_importer && i={}; echo "$my_importer" ; echo "${i} ${array1[@]}" ' ` – gwilymh Jun 12 '15 at 20:56
  • I get the following output: `() { declare -ax array1='([0]="apples" [1]="oranges" [2]="pears")' } () { declare -ax array1='([0]="apples" [1]="oranges" [2]="pears")' } 0 () { declare -ax array1='([0]="apples" [1]="oranges" [2]="pears")' } 1 () { declare -ax array1='([0]="apples" [1]="oranges" [2]="pears")' } 2` It seems to me that the array is being parsed into the import_array function, but still cannot be accessed for some reason. Any suggestions anyone? – gwilymh Jun 12 '15 at 20:58