Better performance substitute for egrep

Question

I am parsing information from a 18GB file, summarize_eigenvectors.out, which has the following structure:

 Special analysis for state  3293   3.56009
    c    v    weight        ik        kx        ky        kz
    1    1   0.00000         1   0.00000   0.00000   0.00000
    1    1   0.00000         2   0.00000   0.04167   0.00000
    1    2   0.00000         1   0.00000   0.00000   0.00000
    1    2   0.00000         2   0.00000   0.04167   0.00000
    2    1   0.00000         1   0.00000   0.00000   0.00000
    2    1   0.00000         2   0.00000   0.04167   0.00000
    2    2   0.00000         1   0.00000   0.00000   0.00000
    2    2   0.00000         2   0.00000   0.04167   0.00000

Special analysis for state  3294   3.56013
    c    v    weight        ik        kx        ky        kz
    1    1   0.00000         1   0.00000   0.00000   0.00000
    1    1   0.00000         2   0.00000   0.04167   0.00000
    1    2   0.00000         1   0.00000   0.00000   0.00000
    1    2   0.00000         2   0.00000   0.04167   0.00000
    2    1   0.00000         1   0.00000   0.00000   0.00000
    2    1   0.00000         2   0.00000   0.04167   0.00000
    2    2   0.00000         1   0.00000   0.00000   0.00000
    2    2   0.00000         2   0.00000   0.04167   0.00000

In the real system the indices go up to

    12  12   0.00000      1152   0.00000   0.00000   0.00000

I am using egrep to parse each section of the big file to smaller files. An additional file, summarize_eigenvectors_range.in, contains the following:

1870        #total number of excitons to analyze
0.35600872E+01
0.35601277E+01
0.35603700E+01
....

The main script is as below:

#!/bin/bash

P=`pwd`

#if [ -d summarize_eigenvectors ]; then
#   rm -r summarize_eigenvectors
#   mkdir summarize_eigenvectors
#   cd summarize_eigenvectors
#else
#   mkdir summarize_eigenvectors
    cd summarize_eigenvectors
#fi

number=$(awk 'NR==1''{ print$1 }' ../summarize_eigenvectors_range.in)
line=$(( $number + 1 ))
i=2
#start_id=$(grep -m 1 "Special analysis for state" ../summarize_eigenvectors.out | awk '{ print$5 }')
start_id=4137
echo start_id = $start_id

while [ $i -le $line ]
do
    exciton_n=$(awk -v i="$i" 'NR==i''{ print$1 }' ../summarize_eigenvectors_range.in)
    nstring=$(echo $exciton_n | awk -F"E" 'BEGIN{OFMT="%10.5f"} {print $1 * (10 ^ $2)}')
    nid=$(( $start_id + $i - 2 ))
    name=`echo "$nid"_"$nstring" | sed -e 's/[[:space:]]//g'`
    echo "$name"
    mkdir "$name"
    cd "$name"
    mkdir sorted
    egrep -A 165889 "Special analysis for state.*$nid" ../../summarize_eigenvectors.out > $name.txt
    for c in $(seq 1 12); do
        for v in $(seq 1 12); do
            echo -e "    c    v    weight        ik        kx        ky        kz" > "$name"-"$c"_"$v".txt
            awk -v c="$c" -v v="$v" '{ if ($1 == c && $2 == v)  print }' $name.txt >> "$name"-"$c"_"$v".txt
            cat "$name"-"$c"_"$v".txt | sort -k 3 -g -r > ./sorted/"$name"-"$c"_"$v"-sorted.txt
        done
    done
    cd ..
    i=$(( $i + 1 ))
done

This operation takes about 30 seconds per section, and I have thousands of such sections. Is there a better way of doing this so the script runs faster? I'm thinking about using awk, but don't know how to combine search of string and variable together; also I don't know if it will have better performance.

Any insight on where is the performance bottleneck and any recommendations on how to improve the code?

Sample output: a few thousand files, one type contains everything in the "Special analysis" section, with the following content:

 Special analysis for state  {nid}   x.xxxxx
    c    v    weight        ik        kx        ky        kz
    1    1   0.00000         1   0.00000   0.00000   0.00000
....
    12  12   0.00000      1152   0.00000   0.00000   0.00000

Another type that divides the above file into c1v1 c1v2, etc. the c1v1 file will look like the following

    c    v    weight        ik        kx        ky        kz
    1    1   0.00000         1   0.00000   0.00000   0.00000
    1    1   0.00000         2   0.00000   0.00000   0.00000
....
    1    1   0.00000      1152   0.00000   0.00000   0.00000

the c1v2 file will look like the following

    c    v    weight        ik        kx        ky        kz
    2    2   0.00000         1   0.00000   0.00000   0.00000
    2    2   0.00000         2   0.00000   0.00000   0.00000
....
    2    2   0.00000      1152   0.00000   0.00000   0.00000

need more details, eg, what is `$line`, a brief textual explanation of what you're doing (parsing chunks of 1154 lines into separate files?), perhaps a few more sample input lines for the first 2 sections, and the desired output corresponding to the sample input; having said that, I'm guessing the whole thing could be done with `awk` and a single pass through the file (as opposed to the thousands of passes you're currently making through the file) — markp-fuso, Nov 26 '20 at 14:14
@Jacek : I don't see where you set `line` in your script, but in any case, you create a child process for each iteration in your loop, and each of these child processes has to traverse the huge file sequentially from start to end. Perhaps you could think of redesigning your article, so that it needs only a single pass of the file. You just need a "buffer" for holding the 1153 lines before and after the match, for creating the different output files. — user1934428, Nov 26 '20 at 14:32
See also [Why is using a shell loop to process text considered bad practice?](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) — Sundeep, Nov 26 '20 at 14:45
Get rid of the `...`s from your sample input and add the expected output given that input to provide something we can test a potential solution against. See [ask]. — Ed Morton, Nov 26 '20 at 14:46
It's always puzzling to me when I see code where someone knows to use `$(...)` but only does it for some lines, e.g. `start_id=$(...)` but ```name=`...` ``` just 5 lines later. — Ed Morton, Nov 26 '20 at 15:14
I've edited the question to give the full picture. Hope this is clearer. — Jacek, Nov 26 '20 at 17:40
You've still got a bunch of `...`s in your sample input and expected output which presumably don't exist in your real data so you still haven't given us something we could test a potential solution against which makes it hard to provide any help with how to do whatever it is you're trying to do the right way. — Ed Morton, Nov 27 '20 at 00:19
I've editted the input to show a minimum working model of what the real input is, reducing the looping indices from 1152 to 2 and 12 to 2 for ik and "c and v" respectively — Jacek, Nov 27 '20 at 01:57
Your updated question is much better, but it invalidates all the answers you already received. I would suggest that you roll back your edit, accept one of the answers here (or post an answer of your own and accept that, if you prefer) and post a new question linking back to this one. — tripleee, Nov 27 '20 at 07:57

tripleee · Answer 1 · 2020-11-27T05:47:33.220

3

As others have noted, the real problem is that you are repeatedly traversing the entire 18Gb input file over and over. You are even using Awk already, so converting to a single pass will not be particularly hard.

awk '/Special analysis for state / {
    if(out) close(out)
    out = $5 "_some_other_identifier_taken_from_another_file-1_1.txt"
    n = 1153 }
n { n--; print >out }' the_18gb_file.out

This assumes that the analyses are not overlapping in the input file.

You are not revealing where "some other identifier" should come from, but hopefully it won't be very hard to integrate into this script.

Awk examines a line at a time and processes the script on each; variables which have not been set will simply be empty (which conveniently evaluates to "false" in boolean context, and zero in numeric context). When we see the marker for a new entry in the line we are processing now, we close any already-open file (if out is defined from a previous iteration) and then set up the counter for the file name for the next one and the number of lines to write to it. The next condition is true if we have not yet written that many lines; then we decrement n and write to the file we designated in the previous condition.

edited Nov 27 '20 at 05:47

answered Nov 26 '20 at 15:48

tripleee

175,061
34
275
318

This is a common problem; unfortunately, few people think to google the obvious "my algorithm sucks but I wantonly blame my tool". – tripleee Nov 26 '20 at 15:55
Can you explain this code? I don't understand the if part. This seems to be a way forward. – Jacek Nov 27 '20 at 04:17
Added a brief explanation. If you need help grafting in "some other identifier", maybe accept this and ask a new question; feel free to ping me here with a comment if you do. – tripleee Nov 27 '20 at 05:47
Thanks for the explanation. But any idea how this compares, performance wise, to the csplit attempt? – Jacek Nov 27 '20 at 07:06
Probably they are about the same, but this one is closer to your specification if, for example, there are some lines you don't want at the top of the file or between the entries we extract. – tripleee Nov 27 '20 at 07:51

DavidW · Answer 2 · 2020-11-26T14:25:12.603

2

If your goal is simply to split the file on a delimiter, then you should use csplit if it's available. Something like this:

csplit --quiet --digits=4 -z the_18gb_file.out "/Special analysis/" "{*}"

edited Nov 26 '20 at 14:25

answered Nov 26 '20 at 14:19

DavidW

353
2
10

This is great in generating the first type of files, but the subsequent operations will be troublesome. – Jacek Nov 26 '20 at 17:42

score 0 · Answer 3 · answered Nov 26 '20 at 14:37

The poor performance is not due to egrep, but due to the fact that you are reading the file line by line in a while-loop (which is even not needed when you're using grep). If, however, you want to split your file and do some grep on the leftovers, you might look for a way to split your file without while-loop, as proposed by DavidW.

Jacek · Accepted Answer · 2020-11-28T18:19:25.307

thanks a lot for all your input. I've adopted DavidW's suggestion of using csplit for the initial step. The rest of the filing done by the awk lines in the question still needs work. However, I've managed to parallelize the process with the code attached below. Now the initial 14 hours job has been compressed to finish in a few minutes. Hope is is helpful for anyone with the same issue.

#!/bin/bash

P=`pwd`

if [ -d summarize_eigenvectors ]; then
    rm -r summarize_eigenvectors
    mkdir summarize_eigenvectors
    cd summarize_eigenvectors
else
    mkdir summarize_eigenvectors
    cd summarize_eigenvectors
fi

csplit --quiet --digits=4 -z ../summarize_eigenvectors.out "/Special analysis/" "{*}" 

mv xx0000 k-point_index

task(){
    nid=$(awk 'NR==1''{ print$5 }' $file)
    energy=$(awk 'NR==1''{ print$6 }' $file)
    name=$(echo "$nid"_"$energy")
    echo $name
    mkdir $name
    mv $file ./$name/$name.txt
    cd $name

    for c in $(seq 1 12); do
        for v in $(seq 1 12); do
            echo -e "    c    v    weight        ik        kx        ky        kz" > "$name"-"$c"_"$v".txt
            awk -v c="$c" -v v="$v" '{ if ($1 == c && $2 == v)  print }' $name.txt >> "$name"-"$c"_"$v".txt
            cat "$name"-"$c"_"$v".txt | sort -k 3 -g -r > "$name"-"$c"_"$v"-sorted.txt &
        done
    done
    cd ..
}


for file in ./xx*; do
((i=i%120)); ((i++==0)) && wait
task "$file" &
done
wait

The parallelization lines are adapted from https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop — Jacek, Nov 27 '20 at 15:07
`P=\`pwd\` ` seems completely superfluous, you don't appear to use this variable for anything; anyway, the shell already knows which directory it's in (Bash has `$PWD` built-in, and usually you don't need to know anyway, unless you want to convert relative paths to absolute ones). — tripleee, Nov 30 '20 at 05:44
Also, lose the [useless `echo`.](http://www.iki.fi/era/unix/award.html#echo) More generally [quote your variables](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable) and probably run this through http://shellcheck.net/ for any additional diagnostics. — tripleee, Nov 30 '20 at 05:46

Better performance substitute for egrep

4 Answers4