Why isn't grepping beginning of lines faster?

Question

I have huge log files, and I am trying to "filter" them according to their line prefixes. Using grep is really fast, but not fast enough; typical results:

$ time grep "E ::" app.log

real    0m11.159s
user    0m10.081s
sys     0m1.040s

I thought I might save grep some effort if I'll tell it that the prefix E :: is actually a prefix, that is, it appears in the beginning of the line. I believed that this will let grep skip looking for it along the long lines in my log file. However, as it seems, it doesn't do much:

$ time grep "^E ::" app.log

real    0m11.152s
user    0m10.229s
sys     0m0.884s

Grepping ^E is about 15% faster.

Do you have any idea why? Can you think of a faster way to filter these 9GB log files according to the first char in each line?

Have you tried with huge files, to see if the difference keeps constant or grows in a way? I did test with a file containing `seq 10000000` and the difference is around 3x times. — fedorqui, Mar 11 '14 at 10:45
The results I've shown are for 9GB log files. I guess the difference will stay constant, if it has stayed constant so far. — Bach, Mar 11 '14 at 10:47
Most single spindle disks run around 100-150MB/s transfer rate if on SATA, and 40MB/s on USB. The fastest RAIDs run around 600MB/s so you are lucky to even read a 9GB file in 11 seconds!!! — Mark Setchell, Mar 11 '14 at 10:55
In fact, try this typing the parameters very carefully to measure the disk read speed: `time dd if=app.log of=/dev/null bs=1024k` — Mark Setchell, Mar 11 '14 at 10:56
I guess I am saying the answer to your question is that you are I/O bound, not CPU or algorithm bound. — Mark Setchell, Mar 11 '14 at 10:58
Could you provide your `grep` version as well (`grep --version`)? — Adrian Frühwirth, Mar 11 '14 at 15:12

score 1 · Answer 1 · answered Mar 11 '14 at 10:52

1

You can try GNU parallel, e.g.

cat app.log | parallel --pipe grep '^E ::'

See the link for different examples on how to tweak this (how many jobs to run, into how big chunks you want the input file to be split etc.).

answered Mar 11 '14 at 10:52

Adrian Frühwirth

42,970
10
60
71

Can't run that code on my machine (missing `-c` argument?) – Bach Mar 11 '14 at 11:01
@Bach Where does `-c` come from? – Adrian Frühwirth Mar 11 '14 at 11:57
I get quite a lot of `/bin/bash: -c: option requires an argument` messages, and nothing happens. – Bach Mar 11 '14 at 12:09
@Bach GNU `parallel` is a `perl` script so that doesn't make sense, do you actually have the correct program? – Adrian Frühwirth Mar 11 '14 at 12:14
Even running that: `echo "test" | parallel --pipe` gives me 8 such messages. My OS is ubuntu. I believe I have the correct `parallel`. – Bach Mar 11 '14 at 12:17
@Bach You "believe"? What does `parallel --version` say? – Adrian Frühwirth Mar 11 '14 at 12:28
`GNU parallel 20121122` (and some more stuff). – Bach Mar 11 '14 at 12:31
Are you hit by: http://stackoverflow.com/questions/16448887/gnu-parallel-not-working-at-all – Ole Tange Mar 11 '14 at 15:16
@OleTange Looks like that's his issue :-) Did you see this post by accident (since it's not tagged) or do you have some magic notifier? Anyway, nice to have you around. I should use your tool more often . – Adrian Frühwirth Mar 11 '14 at 15:21
@Bach TL;DR: Try `cat app.log | parallel --gnu --pipe grep '^E ::'` and see if it helps. – Adrian Frühwirth Mar 11 '14 at 15:23
@AdrianFrühwirth: I get the line `grep: ::: No such file or directory` many times. – Bach Mar 12 '14 at 08:12

score 1 · Answer 2 · answered Mar 11 '14 at 11:01

1

Try this:

LC_ALL=C fgrep "E ::" app.log

answered Mar 11 '14 at 11:01

Mark Setchell

191,897
31
273
432

Doesn't change much. However, if I try it with `^E`, it becomes 50% slower. – Bach Mar 11 '14 at 11:06
1

I had to check it, so for other people reading: [What does “LC_ALL=C” do?](http://unix.stackexchange.com/q/87745/40596) – fedorqui Mar 11 '14 at 12:16
@fedorqui +1 for your community spirit. I should have explained it myself. It disables NLS so that `grep` can make assumptions about the type of data that it is looking at (e.g. ASCII, single byte etc) and thereby hopefully go faster. – Mark Setchell Mar 11 '14 at 12:34

score -2 · Answer 3 · answered Mar 11 '14 at 10:49

-2

try this

[honeypot]# (time ls) 1> /dev/null 2> output
[honeypot]# cat output

real    0m0.020s
user    0m0.001s
sys    0m0.006s

answered Mar 11 '14 at 10:49

Adrian

5
7

3

What should that do? How does it answer my question? – Bach Mar 11 '14 at 11:07

Why isn't grepping beginning of lines faster?

3 Answers3