I am parsing information from a 18GB file, summarize_eigenvectors.out
, which has the following structure:
Special analysis for state 3293 3.56009
c v weight ik kx ky kz
1 1 0.00000 1 0.00000 0.00000 0.00000
1 1 0.00000 2 0.00000 0.04167 0.00000
1 2 0.00000 1 0.00000 0.00000 0.00000
1 2 0.00000 2 0.00000 0.04167 0.00000
2 1 0.00000 1 0.00000 0.00000 0.00000
2 1 0.00000 2 0.00000 0.04167 0.00000
2 2 0.00000 1 0.00000 0.00000 0.00000
2 2 0.00000 2 0.00000 0.04167 0.00000
Special analysis for state 3294 3.56013
c v weight ik kx ky kz
1 1 0.00000 1 0.00000 0.00000 0.00000
1 1 0.00000 2 0.00000 0.04167 0.00000
1 2 0.00000 1 0.00000 0.00000 0.00000
1 2 0.00000 2 0.00000 0.04167 0.00000
2 1 0.00000 1 0.00000 0.00000 0.00000
2 1 0.00000 2 0.00000 0.04167 0.00000
2 2 0.00000 1 0.00000 0.00000 0.00000
2 2 0.00000 2 0.00000 0.04167 0.00000
In the real system the indices go up to
12 12 0.00000 1152 0.00000 0.00000 0.00000
I am using egrep to parse each section of the big file to smaller files. An additional file, summarize_eigenvectors_range.in
, contains the following:
1870 #total number of excitons to analyze
0.35600872E+01
0.35601277E+01
0.35603700E+01
....
The main script is as below:
#!/bin/bash
P=`pwd`
#if [ -d summarize_eigenvectors ]; then
# rm -r summarize_eigenvectors
# mkdir summarize_eigenvectors
# cd summarize_eigenvectors
#else
# mkdir summarize_eigenvectors
cd summarize_eigenvectors
#fi
number=$(awk 'NR==1''{ print$1 }' ../summarize_eigenvectors_range.in)
line=$(( $number + 1 ))
i=2
#start_id=$(grep -m 1 "Special analysis for state" ../summarize_eigenvectors.out | awk '{ print$5 }')
start_id=4137
echo start_id = $start_id
while [ $i -le $line ]
do
exciton_n=$(awk -v i="$i" 'NR==i''{ print$1 }' ../summarize_eigenvectors_range.in)
nstring=$(echo $exciton_n | awk -F"E" 'BEGIN{OFMT="%10.5f"} {print $1 * (10 ^ $2)}')
nid=$(( $start_id + $i - 2 ))
name=`echo "$nid"_"$nstring" | sed -e 's/[[:space:]]//g'`
echo "$name"
mkdir "$name"
cd "$name"
mkdir sorted
egrep -A 165889 "Special analysis for state.*$nid" ../../summarize_eigenvectors.out > $name.txt
for c in $(seq 1 12); do
for v in $(seq 1 12); do
echo -e " c v weight ik kx ky kz" > "$name"-"$c"_"$v".txt
awk -v c="$c" -v v="$v" '{ if ($1 == c && $2 == v) print }' $name.txt >> "$name"-"$c"_"$v".txt
cat "$name"-"$c"_"$v".txt | sort -k 3 -g -r > ./sorted/"$name"-"$c"_"$v"-sorted.txt
done
done
cd ..
i=$(( $i + 1 ))
done
This operation takes about 30 seconds per section, and I have thousands of such sections. Is there a better way of doing this so the script runs faster? I'm thinking about using awk, but don't know how to combine search of string and variable together; also I don't know if it will have better performance.
Any insight on where is the performance bottleneck and any recommendations on how to improve the code?
Sample output: a few thousand files, one type contains everything in the "Special analysis" section, with the following content:
Special analysis for state {nid} x.xxxxx
c v weight ik kx ky kz
1 1 0.00000 1 0.00000 0.00000 0.00000
....
12 12 0.00000 1152 0.00000 0.00000 0.00000
Another type that divides the above file into c1v1 c1v2, etc. the c1v1 file will look like the following
c v weight ik kx ky kz
1 1 0.00000 1 0.00000 0.00000 0.00000
1 1 0.00000 2 0.00000 0.00000 0.00000
....
1 1 0.00000 1152 0.00000 0.00000 0.00000
the c1v2 file will look like the following
c v weight ik kx ky kz
2 2 0.00000 1 0.00000 0.00000 0.00000
2 2 0.00000 2 0.00000 0.00000 0.00000
....
2 2 0.00000 1152 0.00000 0.00000 0.00000