1

I need to egrep from a large csv file with 2 million lines, I want to cut down the egrep time to 0.5 sec, is this possible at all? no, I don't want database (sqlite3 or MySQL) at this time..

$ time wc foo.csv
2000000 22805420 334452932 foo.csv
real 0m3.396s
user 0m3.261s
sys 0m0.115s

I've been able to cut down the run time from 40 secs to 1.75 secs

$ time egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l

108292

real    0m40.707s
user    0m40.137s
sys     0m0.309s

$ time LC_ALL=C egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l

108292

real    0m1.751s
user    0m1.590s
sys     0m0.140s

but I want the egrep real time to be less than half a second, any tricks will be greatly appreciated, the file changes continuously, so I can't use any cache mechanism...

Etan Reisner
  • 77,877
  • 8
  • 106
  • 148
Cindy Turlington
  • 2,456
  • 3
  • 15
  • 20

1 Answers1

2

If you are just searching for keywords, you could use fgrep (or grep -F) instead of egrep:

LC_ALL=C grep -F -i -e storm -e broadway -e "parkway center" -e "chief financial"

The next thing to try would be factoring out -i, which is probably now the bottleneck. If you're sure that only the first letter might be capitalized, for example, you could do:

LC_ALL=C grep -F \
   -e{S,s}torm -e{B,b}roadway -e{P,p}"arkway "{C,c}enter -e{C,c}"hief "{F,f}inancial
rici
  • 234,347
  • 28
  • 237
  • 341