0

I have two files:

temp_bandstructure.dat has the following format

# spin    band          kx          ky          kz          E(MF)          E(QP)        Delta E kn  E(MF)5dp
#                        (Cartesian coordinates)             (eV)           (eV)           (eV)     (eV)
     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 1   -3.02167
     1      22     0.00850     0.00000     0.00000   -3.026245712   -4.027334803   -1.001089091 2   -3.02625
     1      22     0.01699     0.00000     0.00000   -3.039924052   -4.061680485   -1.021756433 3   -3.03992
9000 more rows

mf_pband.dat has 46 header rows followed by the following

  1     0.00000   -55.55593   0.998   0.000 ...20 more columns
9000 more rows

I have a nested for loop that compares column 1 and 3 of every row in mf_pband.dat against column 9 and 10 of every row in temp_bandstructure.dat. If the numbers in match within a value of 0.00001, then the script will print out the entire row of mf_pband.dat to a cache file.

I wrote a working for loop that gets the job done, but at a very slow pace:

kmax=207
bandmin=$(cat bandstructure.dat | awk 'NR==3''{ print$2 }')
bandmax=$(tac bandstructure.dat | awk 'NR==1''{ print$2 }')
nband=$(($bandmax-$bandmin+1))
nheader=46


for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
    kn=$(awk -v i=$i 'NR==i''{ print$9 }'  temp_bandstructure.dat)
    emf=$(awk -v i=$i 'NR==i''{ print$10 }'  temp_bandstructure.dat)
    
    for ((j=$(($nheader+1));j<=$(($kmax*$nband+$nheader)); j++)); do
        kn_mf_pband=$(awk -v j=$j 'NR==j''{ print$1 }'  mf_pband.dat)
        emf_mf_pband=$(awk -v j=$j 'NR==j''{ print$3 }'  mf_pband.dat)
        if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
        then
            awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
            echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
            break
        fi
    done
done

Now I'm trying to send one of the for loop to background tasks so I can run many of them in parallel. The modified code does not give errors, but shows no progress neither:

task(){
    kn_mf_pband=$(awk -v j=$j 'NR==j''{ print$1 }'  mf_pband.dat)
    emf_mf_pband=$(awk -v j=$j 'NR==j''{ print$3 }'  mf_pband.dat)
    if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
    then
        awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
        echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
    fi
}


for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
    kn=$(awk -v i=$i 'NR==i''{ print$9 }'  temp_bandstructure.dat)
    emf=$(awk -v i=$i 'NR==i''{ print$10 }'  temp_bandstructure.dat)
    
    for j in {$(($nheader+1))..$(($kmax*$nband+$nheader))}; do
        ((i=i%20)); ((i++==0)) && wait
        task "$j" &
    done
done
wait

Can anyone tell me why the tasks are not running and more importantly, how can I get them to run properly?

Jacek
  • 571
  • 1
  • 3
  • 12
  • 1
    Try to fix the errors pointed out by https://www.shellcheck.net/ first. Especially the brace expansion in `for j in {$(($nheader+1))..$(($kmax*$nband+$nheader))}` which does not expand because of the variables. Anyways, this is so slow because you read the same file over and over again. `for i ...; do awk -v i=$i 'NR==i'; done` has quadratic time complexity. You could try to rewrite the script in `awk` only to make it tremendously faster. – Socowi Jul 06 '21 at 14:03
  • @Socowi,can you give a brief example with awk so I can emulate? – Jacek Jul 06 '21 at 14:14
  • Reading the whole file from the beginning to just find a numbered line is silly. Much more efficient to read line-by-line; see [BashFAQ #1](https://mywiki.wooledge.org/BashFAQ/001) -- if you can't just do all the logic in native awk. – Charles Duffy Jul 06 '21 at 14:36
  • 1
    @Jacek I added an answer with an skeleton of such an `awk` script. – Socowi Jul 06 '21 at 14:37
  • BTW, note that `cat somefile | awk ...` is slower than `awk ... – Charles Duffy Jul 06 '21 at 14:42

1 Answers1

3

The problem is in

for j in {$(($nheader+1))..$(($kmax*$nband+$nheader))}; do
    ((i=i%20)); ((i++==0)) && wait
    task "$j" &
done

Here, the brace expansion {$(($nheader+1))..$(($kmax*$nband+$nheader))} does not expand to a list of numbers, but to the literal string {47..1234} (actual number for 1234 depends on your file contents).
Then you start task '{47..1234}' & which does nothing, because in task you try to extract values with awk -v j='{47..1234}' 'NR==j', but NR is never {47..1234}.
To fix this, use seq or for ((...; ...; ...)). See How do I iterate over a range of numbers defined by variables in Bash?.

Anyways, your script is slow because you read the same file over and over again (and because you are starting up a lot of processes). for i ...; do awk -v i=$i 'NR==i'; done has quadratic time complexity. You could try to rewrite the script in awk only to make it tremendously faster. First, read one of the files into an array and keep it in memory, then process the other file.

Here is a skeleton of such an awk script. The idiom FNR==NR is only true when processing the first file.

awk -v bandmax="$(tail -n1 bandstructure.dat | awk '{print $2}')" -v nheader=46 '
  FNR==NR && NR>nheader { kn_mf_pband[NR-nheader]=$9; em_mf_pband[NR-nheader]=$10 }
  FNR==NR { next }
  # because of the `next` the following rules are only processed for the 2nd file
  FNR==3 { bandmin=$2 }
  FNR>2 {
    # here you can use for loops to iterate over the stored values in
    # kn_mf_pband[...] and em_mf_pband[...]
  }
' mf_pband.dat bandstructure.dat
Socowi
  • 25,550
  • 3
  • 32
  • 54
  • I've gave AWK a try but am facing some issues. Can you help me take a look in the modified question? Thanks a lot Socowi! – Jacek Jul 07 '21 at 07:37
  • @Jacek Yes, I can help you. But please don't change your question to something completely different midway. Please open a new question and tag it as `awk`, then [roll back](https://stackoverflow.com/posts/68271823/revisions) your question here to its original so that this answer stays valid. – Socowi Jul 07 '21 at 09:18
  • Thanks for the advice Socowi. Here's the link to the new question [link](https://stackoverflow.com/q/68286106/10747564) – Jacek Jul 07 '21 at 12:33