What is the fastest possible egrep

Question

I need to egrep from a large csv file with 2 million lines, I want to cut down the egrep time to 0.5 sec, is this possible at all? no, I don't want database (sqlite3 or MySQL) at this time..

$ time wc foo.csv
2000000 22805420 334452932 foo.csv
real 0m3.396s
user 0m3.261s
sys 0m0.115s

I've been able to cut down the run time from 40 secs to 1.75 secs

$ time egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l

108292

real    0m40.707s
user    0m40.137s
sys     0m0.309s

$ time LC_ALL=C egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l

108292

real    0m1.751s
user    0m1.590s
sys     0m0.140s

but I want the egrep real time to be less than half a second, any tricks will be greatly appreciated, the file changes continuously, so I can't use any cache mechanism...

grep probably runs faster than your disk system can spit out the data. If you want faster results, then get a faster disk. — Marc B, Jun 19 '14 at 17:08
how much time runs `wc foo.csv` on your system? Please add `time wc foo.csv`. — , Jun 19 '14 at 17:12
$ time wc foo.csv 2000000 22805420 334452932 foo.csv real 0m3.396s user 0m3.261s sys 0m0.115s — Cindy Turlington, Jun 19 '14 at 17:15
Can you try: `time LC_ALL=C egrep -ic "storm|broadway|parkway center|chief financial" foo.csv` and see what time you get — anubhava, Jun 19 '14 at 17:16
Try the `--mmap` option to `egrep`? If you can pre-read the file before you need to grep it then you can probably get it into cache even assuming it changes all the time. Do subsequent runs of `egrep` back-to-back run faster the second+ times? — Etan Reisner, Jun 19 '14 at 17:19

rici · Answer 1 · 2014-06-19T17:38:38.147

2

If you are just searching for keywords, you could use fgrep (or grep -F) instead of egrep:

LC_ALL=C grep -F -i -e storm -e broadway -e "parkway center" -e "chief financial"

The next thing to try would be factoring out -i, which is probably now the bottleneck. If you're sure that only the first letter might be capitalized, for example, you could do:

LC_ALL=C grep -F \
   -e{S,s}torm -e{B,b}roadway -e{P,p}"arkway "{C,c}enter -e{C,c}"hief "{F,f}inancial

edited Jun 19 '14 at 17:38

answered Jun 19 '14 at 17:19

rici

234,347
28
237
341

rici, great! I cut that down to 1 sec average on 3 tries, 0.5 more second to cut ! – Cindy Turlington Jun 19 '14 at 17:27
@Cindy: Cool. Ok, had another thought, but I don't know how much difference it will make. Edited answer. – rici Jun 19 '14 at 17:39
rici: -i has to be there – Cindy Turlington Jun 19 '14 at 17:46
rici: anyway, without -i, saves 0.1 second – Cindy Turlington Jun 19 '14 at 17:49
@CindyTurlington: 10% is 10% :) If you're not expecting the file to contain "paRKwaY ceNteR", might be worth it. Beyond that, I think your only option is parallelism, but that's tricky because you can't a priori know where to split the file. – rici Jun 19 '14 at 17:50

What is the fastest possible egrep

1 Answers1

Linked