I have two files:
temp_bandstructure.dat
has the following format
# spin band kx ky kz E(MF) E(QP) Delta E kn E(MF)5dp
# (Cartesian coordinates) (eV) (eV) (eV) (eV)
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 1 -3.02167
1 22 0.00850 0.00000 0.00000 -3.026245712 -4.027334803 -1.001089091 2 -3.02625
1 22 0.01699 0.00000 0.00000 -3.039924052 -4.061680485 -1.021756433 3 -3.03992
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 4 -3.02167
1 29 0.00000 0.00000 0.00000 -1.344238286 -2.629257334 -1.285019048 1 -1.34424
mf_pband.dat
has 46 header rows and more data rows than temp_bandstructure.dat. The extra data are not useful and should not make its way into the final output.
#header row
#header row
3 0.02000 -3.03993 0.984 0.000 0.010 0.011 0.000 0.000 0.010 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.426 0.000 0.001 0.000 0.426 0.000
2 0.01000 -3.02624 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
4 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 0.00000 -1.34424 0.994 0.000 0.000 0.046 0.000 0.000 0.000 0.046 0.000 0.000 0.004 0.263 0.000 0.000 0.004 0.263 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.002 0.013 0.000 0.000 0.002 0.013
1 0.00000 -55.55593 0.998 0.000 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.003 0.000 0.000 0.490 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.492 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.002 0.000 0.000 0.000 0.002 0.000 0.000 0.000
I have a nested for loop that compares column 1 and 3 of every row in mf_pband.dat
against column 9 and 10 of every row in temp_bandstructure.dat
. If the numbers in match within a value of 0.00001, then the script will print out the entire row of mf_pband.dat
to a cache file.
For example, the script should be able to match row 4, 2, 1, 3, 5 of mf_pband.dat with row 1, 2, 3, 4, 5 of temp_bandstructure.dat, giving the output
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 1 -3.02167 1 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 22 0.00850 0.00000 0.00000 -3.026245712 -4.027334803 -1.001089091 2 -3.02625 2 0.01000 -3.02624 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 22 0.01699 0.00000 0.00000 -3.039924052 -4.061680485 -1.021756433 3 -3.03992 3 0.02000 -3.03993 0.984 0.000 0.010 0.011 0.000 0.000 0.010 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.426 0.000 0.001 0.000 0.426 0.000
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 4 -3.02167 4 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 29 0.00000 0.00000 0.00000 -1.344238286 -2.629257334 -1.285019048 1 -1.34424 1 0.00000 -1.34424 0.994 0.000 0.000 0.046 0.000 0.000 0.000 0.046 0.000 0.000 0.004 0.263 0.000 0.000 0.004 0.263 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.002 0.013 0.000 0.000 0.002 0.013
The extra row 6 of mf_pband.dat does not make into the final output as it does not have a match.
I wrote a working for loop that gets the job done, but at a very slow pace:
kmax=207
bandmin=$(cat bandstructure.dat | awk 'NR==3''{ print$2 }')
bandmax=$(tac bandstructure.dat | awk 'NR==1''{ print$2 }')
nband=$(($bandmax-$bandmin+1))
nheader=46
for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
kn=$(awk -v i=$i 'NR==i''{ print$9 }' temp_bandstructure.dat)
emf=$(awk -v i=$i 'NR==i''{ print$10 }' temp_bandstructure.dat)
for ((j=$(($nheader+1));j<=$(($kmax*$nband+$nheader)); j++)); do
kn_mf_pband=$(awk -v j=$j 'NR==j''{ print$1 }' mf_pband.dat)
emf_mf_pband=$(awk -v j=$j 'NR==j''{ print$3 }' mf_pband.dat)
if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
then
awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
break
fi
done
done
Now I'm trying to use AWK arrays to speed up the process. Drawing my inspiration from Socowi and here, I managed to write the following to replace the for loops. However, I am unfamiliar with how to reference the arrays with the correct syntax.
awk -v nheader=$nheader 'NR==FNR && NR>nheader { a[NR-nheader]=$1; b[NR-nheader]=$3; c[NR-nheader]=$0 next }
FNR>2 { d[NR-2]=$9; e[Nr-2]=$10 }(a == d) && (abs(b - e) <= 0.00001){ print $0, c[$1] }' mf_pband.dat temp_bandstructure.dat > temp_copying_cache.dat
Can anyone tell me how the correct syntax should be?
Update:
Developing on @EdMorton's solution, I have managed the following code, which uses NR as the array indices to overcome the issue of repeated values in $9. However, something is not right and the code currently is not producing any output.
awk -v nheader=$nheader '
/^#/ { next }
NR==FNR { rec[NR]=$0; k[NR]=$9; val[NR]=$10; next }
($1 == k[NR]) && (abs(val[NR] - $3) <= 0.0001) { print rec[NR], $0 }
function abs(x) { return (x<0 ? -x : x) }
' temp_bandstructure.dat mf_pband.dat > temp_copying_cache.dat