AWK comparing two columns in each of two files that have headers

Question

I have two files:

temp_bandstructure.dat has the following format

# spin    band          kx          ky          kz          E(MF)          E(QP)        Delta E kn  E(MF)5dp
#                        (Cartesian coordinates)             (eV)           (eV)           (eV)     (eV)
     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 1   -3.02167
     1      22     0.00850     0.00000     0.00000   -3.026245712   -4.027334803   -1.001089091 2   -3.02625
     1      22     0.01699     0.00000     0.00000   -3.039924052   -4.061680485   -1.021756433 3   -3.03992
     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 4   -3.02167
     1      29     0.00000     0.00000     0.00000   -1.344238286   -2.629257334   -1.285019048 1   -1.34424

mf_pband.dat has 46 header rows and more data rows than temp_bandstructure.dat. The extra data are not useful and should not make its way into the final output.

#header row
#header row
  3     0.02000    -3.03993   0.984   0.000   0.010   0.011   0.000   0.000   0.010   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.426   0.000   0.001   0.000   0.426   0.000
  2     0.01000    -3.02624   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
  4     0.00000    -3.02167   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
  1     0.00000    -3.02167   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
  1     0.00000    -1.34424   0.994   0.000   0.000   0.046   0.000   0.000   0.000   0.046   0.000   0.000   0.004   0.263   0.000   0.000   0.004   0.263   0.000   0.000   0.000   0.000   0.000   0.018   0.000   0.000   0.002   0.149   0.000   0.000   0.000   0.000   0.000   0.018   0.000   0.000   0.002   0.149   0.000   0.000   0.000   0.002   0.013   0.000   0.000   0.002   0.013
  1     0.00000   -55.55593   0.998   0.000   0.001   0.000   0.000   0.000   0.001   0.000   0.000   0.000   0.003   0.000   0.000   0.000   0.003   0.000   0.000   0.490   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.002   0.492   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.002   0.002   0.000   0.000   0.000   0.002   0.000   0.000   0.000

I have a nested for loop that compares column 1 and 3 of every row in mf_pband.dat against column 9 and 10 of every row in temp_bandstructure.dat. If the numbers in match within a value of 0.00001, then the script will print out the entire row of mf_pband.dat to a cache file. For example, the script should be able to match row 4, 2, 1, 3, 5 of mf_pband.dat with row 1, 2, 3, 4, 5 of temp_bandstructure.dat, giving the output

     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 1   -3.02167  1     0.00000    -3.02167   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
     1      22     0.00850     0.00000     0.00000   -3.026245712   -4.027334803   -1.001089091 2   -3.02625  2     0.01000    -3.02624   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
     1      22     0.01699     0.00000     0.00000   -3.039924052   -4.061680485   -1.021756433 3   -3.03992  3     0.02000    -3.03993   0.984   0.000   0.010   0.011   0.000   0.000   0.010   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.426   0.000   0.001   0.000   0.426   0.000
     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 4   -3.02167  4     0.00000    -3.02167   0.982   0.000   0.009   0.011   0.000   0.000   0.009   0.011   0.000   0.000   0.005   0.014   0.000   0.000   0.005   0.014   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.001   0.000   0.021   0.000   0.003   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.427   0.000   0.000   0.000   0.427   0.000
     1      29     0.00000     0.00000     0.00000   -1.344238286   -2.629257334   -1.285019048 1   -1.34424  1     0.00000    -1.34424   0.994   0.000   0.000   0.046   0.000   0.000   0.000   0.046   0.000   0.000   0.004   0.263   0.000   0.000   0.004   0.263   0.000   0.000   0.000   0.000   0.000   0.018   0.000   0.000   0.002   0.149   0.000   0.000   0.000   0.000   0.000   0.018   0.000   0.000   0.002   0.149   0.000   0.000   0.000   0.002   0.013   0.000   0.000   0.002   0.013

The extra row 6 of mf_pband.dat does not make into the final output as it does not have a match.

I wrote a working for loop that gets the job done, but at a very slow pace:

kmax=207
bandmin=$(cat bandstructure.dat | awk 'NR==3''{ print$2 }')
bandmax=$(tac bandstructure.dat | awk 'NR==1''{ print$2 }')
nband=$(($bandmax-$bandmin+1))
nheader=46


for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
    kn=$(awk -v i=$i 'NR==i''{ print$9 }'  temp_bandstructure.dat)
    emf=$(awk -v i=$i 'NR==i''{ print$10 }'  temp_bandstructure.dat)
    
    for ((j=$(($nheader+1));j<=$(($kmax*$nband+$nheader)); j++)); do
        kn_mf_pband=$(awk -v j=$j 'NR==j''{ print$1 }'  mf_pband.dat)
        emf_mf_pband=$(awk -v j=$j 'NR==j''{ print$3 }'  mf_pband.dat)
        if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
        then
            awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
            echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
            break
        fi
    done
done

Now I'm trying to use AWK arrays to speed up the process. Drawing my inspiration from Socowi and here, I managed to write the following to replace the for loops. However, I am unfamiliar with how to reference the arrays with the correct syntax.

awk -v nheader=$nheader 'NR==FNR && NR>nheader { a[NR-nheader]=$1; b[NR-nheader]=$3; c[NR-nheader]=$0 next }
     FNR>2 { d[NR-2]=$9; e[Nr-2]=$10 }(a == d) && (abs(b - e) <= 0.00001){ print $0, c[$1] }' mf_pband.dat temp_bandstructure.dat > temp_copying_cache.dat

Can anyone tell me how the correct syntax should be?

Update:

Developing on @EdMorton's solution, I have managed the following code, which uses NR as the array indices to overcome the issue of repeated values in $9. However, something is not right and the code currently is not producing any output.

awk -v nheader=$nheader '
    /^#/ { next }
    NR==FNR { rec[NR]=$0; k[NR]=$9; val[NR]=$10; next }
    ($1 == k[NR]) && (abs(val[NR] - $3) <= 0.0001) { print rec[NR], $0 }
    function abs(x) { return (x<0 ? -x : x) }
' temp_bandstructure.dat mf_pband.dat > temp_copying_cache.dat

Do all the headers in the two files actually start with `#` or did you just do that for the example? — dawg, Jul 07 '21 at 13:56
In the actual problem, The two files describe an identical system. There are 9000+ rows in each of the two files, but their sequence are jumbled up. I need this script to do the hard work of rearranging the rows in the second file and concatenate that with the first file so that they align according to $9, $10, $1' $3. There should not be any line that fails the comparison. — Jacek, Jul 07 '21 at 14:11
In a simplified description of the actual problem, file 2 contains information that file 1 does not. The two files are aligned by the indices in column $9 $10 $1' $3'. By correctly aligning the rows and concatenating the them, we will be able to do analysis of what is in file 2 on top of file 1. For example, file 1 describes the location of a population of people, file 2 describes what objects each person owns. By combining these two files correctly, we can see what objects are at what place. (just an example, actual system deals with atoms) — Jacek, Jul 07 '21 at 14:25
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234631/discussion-between-jacek-and-ed-morton). — Jacek, Jul 07 '21 at 17:11

Socowi · Accepted Answer · 2021-07-08T08:29:13.787

Your code ...

awk -v nheader=$nheader '
    /^#/ { next }
    NR==FNR { rec[NR]=$0; k[NR]=$9; val[NR]=$10; next }
    ($1 == k[NR]) && (abs(val[NR] - $3) <= 0.0001) { print rec[NR], $0 }
    function abs(x) { return (x<0 ? -x : x) }
' temp_bandstructure.dat mf_pband.dat > temp_copying_cache.dat

... does not print anything because you assumed NR to be the line number (or something like that) in the current file, when it actually is the the total number of lines processed so far. After the first file is processed, NR keeps on incrementing.

Assume your first file has 99 rows, then you initialize k[] and val[] for indices from 1 to 99. But then in the second file, you access k[] and val[] at the uninitialized indices 100, 101, … . The default value for an uninitialized variable is 0, so your checks $1 == k[NR] and abs(val[NR] - $3) <= 0.0001 fail, because $1 and $3 are never 0 in your file mf_pband.da.

You could use FNR to access the line number in the current file, but then your script still wouldn't do what you want. You would only compare line 1 from the first file to line 1 from the second file and so on, when you actually wanted …

compares column 1 and 3 of every row in mf_pband.dat against column 9 and 10 of every row in temp_bandstructure.dat

Maybe the following works for you. This exploits the fact that the numbers in mf_pband.dat $3 and temp_bandstructure.dat $10 have a precision of at most 0.0001 which is also the allowed delta.

awk -v d=0.00001 -v CONVFMT=%.5f '
  /^#/ { next }
  NR==FNR { a[$1,$3]=a[$1,$3+d]=a[$1,$3-d]=$0; next }
  ($9 SUBSEP $10 in a) { print $0, a[$9,$10] }
' mf_pband.dat temp_bandstructure.dat

The CONVFMT=%.5f ensures that upon calculating $3+d/$3-d the results are always printed with 5 decimal places, the precision of d and the numbers in your file. Without that, the calculation -55.55593+0.00001 would have resulted in the rounded string representation -55.55593 again.

this `a[$1,$3]=a[$1,$3+d]=a[$1,$3-d]=$0` is a very smart move! I was having problem with the array indices as you described. This solves the problem nicely. — Jacek, Jul 08 '21 at 03:52

AWK comparing two columns in each of two files that have headers

1 Answers1

Linked