Optimizing grep -f piping commands

Question

I have two files.

file1 has some keys that start have abc in the second column

et1 abc
et2 abc
et55 abc

file2 has the column 1 values and some other numbers I need to add up:

1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1

For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large

This command seems to be working but it is very slow:

 egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'

How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.

Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1

Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]

Please add your desired output for that sample input to your question. — Cyrus, Dec 17 '18 at 21:48
It's almost always redundant to pipe grep to awk, since awk has built-in regexp matching. — Barmar, Dec 17 '18 at 21:49
The `-r` option is not necessary when you're grepping a specific file, there's no directory to recurse into. — Barmar, Dec 17 '18 at 21:50
Expected output is the sum of all column5 in file2 when the column6 key is present in file1 — identical123456, Dec 17 '18 at 22:21
Don't just tell us, show us. [edit] your question to show the expected output. — Ed Morton, Dec 17 '18 at 23:53
I've revised my answer now that you've clarified the output. It's even simpler than before. — Barmar, Dec 18 '18 at 00:51

Barmar · Answer 1 · 2018-12-18T00:49:44.813

1

Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.

awk 'NR==FNR {if ($2 == "abc") a[$1] = 0; 
              next}
     $6 in a {total += $5}
     END { print total }
    ' file1.tcl file2.tcl

edited Dec 18 '18 at 00:49

answered Dec 17 '18 at 21:54

Barmar

741,623
53
500
612

1

@EdMorton Not needed since I initialized them all to 0. – Barmar Dec 18 '18 at 00:48
1

But it turns out he doesn't want per-key totals, just a grand total. – Barmar Dec 18 '18 at 00:50
Thanks. Possibly final question: How would I go about optimizing the grep/awk pipes in the original post while making minimal changes to the pipes. Is grep -f inherently slow ? – identical123456 Dec 18 '18 at 01:10
No, not particulary. But two processes are usually slower than one, unless `grep` is significantly faster than `awk`'s built-in matching. And `grep` has to search the entire line, `awk` can just match the specific field. – Barmar Dec 18 '18 at 01:12

RavinderSingh13 · Answer 2 · 2018-12-18T01:44:28.283

1

Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.

awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}'  file2.tcl file1.tcl

edited Dec 18 '18 at 01:44

answered Dec 17 '18 at 22:00

RavinderSingh13

130,504
14
57
93

`a[$1]?a[$1]:0` -> `a[$1]+0` – karakfa Dec 17 '18 at 22:23
1

Although doing unnecessary work, this might be faster than the other way around. – karakfa Dec 17 '18 at 22:26
@karakfa, sure changed it now, thanks for letting me know. – RavinderSingh13 Dec 18 '18 at 01:44

Optimizing grep -f piping commands

2 Answers2