137

I am really amazed by the functionality of GREP in shell, earlier I used to use substring method in java but now I use GREP for it and it executes in a matter of seconds, it is blazingly faster than java code that I used to write.(according to my experience I might be wrong though)

That being said I have not been able to figure out how it is happening? there is also not much available on the web.

Can anyone help me with this?

Max
  • 9,100
  • 25
  • 72
  • 109
  • 9
    It is open source so you can have a look for yourself. http://www.gnu.org/software/grep/devel.html – driis Sep 27 '12 at 20:47
  • @WilliamPursell When the execution time goes in the seconds, the JIT has probably warmed up and the mind-numbing difference is due to (1) grep being incredibly smart about what it does and (2) the Java code making a pretty bad algorithm choice for the specific problem grep focuses on. –  Sep 27 '12 at 20:51
  • 3
    How much time does your Java implementation spend starting up the JVM, and how much time does it spend actually executing your code? Or it might be a matter of the algorithm you used in your Java code; an O(N^2) algorithm is likely to be slow in any language. – Keith Thompson Sep 27 '12 at 20:51
  • 7
    Ridiculous Fish has a great writeup answering exactly your question: http://ridiculousfish.com/blog/posts/old-age-and-treachery.html – David Wolever Sep 27 '12 at 20:48

2 Answers2

205

Assuming your question regards GNU grep specifically. Here's a note from the author, Mike Haertel:

GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.

GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it does look at.

GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the Boyer-Moore delta table entries in such a way that it doesn't need to do the loop exit test at every unrolled step. The result of this is that, in the limit, GNU grep averages fewer than 3 x86 instructions executed for each input byte it actually looks at (and it skips many bytes entirely).

GNU grep uses raw Unix input system calls and avoids copying data after reading it. Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!

So instead of using line-oriented input, GNU grep reads raw data into a large buffer, searches the buffer using Boyer-Moore, and only when it finds a match does it go and look for the bounding newlines (Certain command line options like -n disable this optimization.)

This answer is a subset of the information taken from here.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
Steve
  • 51,466
  • 13
  • 89
  • 103
51

To add to Steve's excellent answer.

It may not be widely known but grep is almost always faster when grepping for a longer pattern-string than a short one, because in a longer pattern, Boyer-Moore can skip forward in longer strides to achieve even better sublinear speeds:

Example:

# after running these twice to ensure apples-to-apples comparison
# (everything is in the buffer cache) 

$ time grep -c 'tg=f_c' 20140910.log
28
0.168u 0.068s 0:00.26

$ time grep -c ' /cc/merchant.json tg=f_c' 20140910.log
28
0.100u 0.056s 0:00.17

The longer form is 35% faster!

How come? Boyer-Moore consructs a skip-forward table from the pattern-string, and whenever there's a mismatch, it picks the longest skip possible (from last char to first) before comparing a single char in the input to the char in the skip table.

Here's a video explaining Boyer Moore (Credit to kommradHomer)

Another common misconception (for GNU grep) is that fgrep is faster than grep. f in fgrep doesn't stand for 'fast', it stands for 'fixed' (see the man page), and since both are the same program, and both use Boyer-Moore, there's no difference in speed between them when searching for fixed-strings without regexp special chars. The only reason I use fgrep is when there's a regexp special char (like ., [], or *) I don't want it to be interpreted as such. And even then the more portable/standard form of grep -F is preferred over fgrep.

arielf
  • 5,802
  • 1
  • 36
  • 48
  • 3
    It's intuitive that longer patterns are faster. If the pattern was one byte then grep would have to check every byte. If the pattern is 4-bytes then it could makes 4-byte skips. If the pattern was as long as text then grep would only do one step. – noel Nov 04 '14 at 09:51
  • 15
    Yes, it is intuitive -- if you understand how Boyer-Moore works. – arielf Nov 04 '14 at 22:17
  • 3
    Even otherwise it's intuitive. It would be easier to find a long needle in a haystack than a shorter one – RajatJ Nov 02 '17 at 08:09
  • 1
    Invalid analogy. Grep needs to compare _all_ the characters of a string, to verify a full match, unlike the 'long needle' case where it is enough to find any tiny piece of metal in the haystack. The _skips_ and the starting from the _end_, are the critical and non-intuitive points. – arielf Nov 05 '17 at 22:15
  • 2
    The counter example to "being faster when longer" is cases where you have to do a lot of tests before you fail, and you can't move ahead anyway. Say the file `xs.txt` contains 100000000 'x's, and you do `grep yx xs.txt`, then it actually fails to find a match sooner than if you do `grep yxxxxxxxxxxxxxxxxxxx xs.txt`. The Boyer-Moore-Horspool improvement to Boyer-Moore improves on the skip-ahead in that case, but it's probably not going to be only three machine instructions in the general case. – lrn Dec 21 '17 at 10:41
  • 1
    the "good" video you have linked is unwatchable for anyone with functional ears . here is a better one https://www.youtube.com/watch?v=4Xyhb72LCX4 – kommradHomer Oct 13 '18 at 10:48
  • Thanks @kommradHomer - replaced video link and dropped the 'good' qualifier. – arielf Jan 23 '19 at 22:30
  • @arielf kudos for taking the time to do so on an old answer :) – kommradHomer Jan 24 '19 at 07:31
  • At my side, `grep -F string` is a bit faster than `grep string`, which is faster than `grep [s]tring` which still is faster than `fgrep string`. The latter looks puzzling, until you realize, that `/bin/fgrep` is a shell script just running `exec grep -F "$@"`, thus does one more `exec()` (Ubuntu 18.04 LTS) – Tino May 21 '19 at 17:21
  • 2
    @Tino thanks. Yes, it seems that the days of (GNU) `grep/fgrep/egrep` being all hardlinks to the same executable are gone. They (and other extensions like the `z*grep` `bz*grep` utils which decompress on the fly), are now small shell-wrappers around `grep`. Some interesting historical comments on the switch between a single executable & shell wrappers can be found in this commit: https://git.savannah.gnu.org/cgit/grep.git/commit/?id=b639643840ef506594b6c46e5b24d9980a33e78e – arielf May 28 '19 at 06:48