80

I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.

Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.

Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?

codeforester
  • 39,467
  • 16
  • 112
  • 140
pistacchio
  • 56,889
  • 107
  • 278
  • 420
  • 1
    Is this always the same set of files? If you find yourself searching the same (large) set of files with `grep`, perhaps it's time to look for a solution to properly index them (the "best" solution will depend on what kind of files these are). – FatalError Jan 30 '12 at 15:54
  • yes, it is the same set of files. do you think that a fulltext solution like lucene would improve the performance? generally it takes around 30/40 seconds to perform a search through 2500 files (each a literary book) for a total word count of around 250 million words. – pistacchio Jan 30 '12 at 16:02
  • also, if a fulltext solution is the right way to investigate, would you suggest any particular software? this is for a personal, no profit experiment, so simple installation and free would be optimal. – pistacchio Jan 30 '12 at 16:05
  • 1
    `"...or maybe make the search parallel in some way?"` I'd be really excited to hear about this. `grep` should totally be able to operate in parallel, but I suspect the search may still be I/O bound. – Conrad.Dean Jan 30 '12 at 16:05
  • Lucene (and other search engines) is designed for this problem. Expect results in less than 1 sec BUT the trade off is the time learning how to setup and use the system. It's not as easy as grep ;-) . Also consider something like `GNU parallel` with grep if you have multiple CPUs and or machines to use. Good luck. – shellter Jan 30 '12 at 16:12
  • 2
    Have you tried using `ack-grep`? – meder omuraliev Jan 30 '12 at 16:27
  • It might be possible to optimize the regex itself, for instance by rewriting to not need as much backtracking, or rewriting to not need the features of an nfa engine. Do you have specific regexes you are using or just need general speedup? – frankc Jan 30 '12 at 17:14
  • for your information, i'm experimenting with full-text search with woosh, a full python solution (since the application that makes use of grep is in python) and sqlite fts4 – pistacchio Jan 30 '12 at 18:00
  • 2
    Use `ack-grep` or better Ag! http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-than-ack/ – Nicholas Wilson Jan 21 '14 at 18:27
  • Also, `git grep` will be way faster (usually) than `grep` because it uses an index rather than hunting manually through all the files on each invocation. – Nicholas Wilson Jan 21 '14 at 18:27
  • Just `grep -l` for things you know are there at the tops of files. `grep -l` shortcuts out as soon as it finds the match, so speed! ;) – kojiro Mar 26 '14 at 14:03

12 Answers12

104

Try with GNU parallel, which includes an example of how to use it with grep:

grep -r greps recursively through directories. On multicore CPUs GNU parallel can often speed this up.

find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

This will run 1.5 job per core, and give 1000 arguments to grep.

For big files, it can split it the input in several chunks with the --pipe and --block arguments:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
Patryk
  • 22,602
  • 44
  • 128
  • 244
Chewie
  • 7,095
  • 5
  • 29
  • 36
  • How can you preserve `--color`? – redolent Nov 11 '13 at 22:22
  • 5
    use `--color=always` to preserve the grep color (this is true whenever you are using grep in a pipe as well) – Jim Feb 21 '14 at 15:38
  • 2
    If `find` has the `-print0` predicate (most do) it would be preferable to use `find . -type f -print0 | parallel -0 -k …`. My instance of `man(1) parallel` actually says this. Also, I suspect with `globstar` you can make this even faster if you're after a particular file pattern: `shopt -s globstar; parallel -k -j150% -n 1000 -m fgrep -H -n STRING ::: **/*.c` – kojiro Mar 26 '14 at 13:27
  • 3
    @WilliamPursell it's a useful use of `cat` if you want `sudo` to access `bigfile` – Jayen Mar 09 '15 at 07:00
  • 2
    Why do you set 1.5 jobs per core? Why not 1 job per core? – JohnGalt Apr 18 '16 at 10:21
  • 2
    @JohnGalt Often disk I/O will stall one of the processes. By starting a few more than there are cores, there will still be stuff to do for all the cores - even if a few of the jobs are waiting for data. Adjust the 150% to see what works best on your system. – Ole Tange Dec 15 '17 at 09:55
  • If you are searching the same files multiple times with different strings or xargs, put search strings in a file first, then use grep -f file.txt. This saves grep from having to go through the same files for each search term. – user584583 Jan 17 '18 at 23:59
70

If you're searching very large files, then setting your locale can really help.

GNU grep goes a lot faster in the C locale than with UTF-8.

export LC_ALL=C
daveb
  • 74,111
  • 6
  • 45
  • 51
  • 1
    Impressive, looks like this single line gives 2X speed. – Fedir RYKHTIK Jul 08 '13 at 13:41
  • Can someone explain why this is? – Robert E Mealey Dec 18 '14 at 21:12
  • 5
    "Simple byte comparison vs multiple byte character comparison" < says my boss... right right right – Robert E Mealey Dec 18 '14 at 21:19
  • 7
    So this isn't exactly safe, especially if you are pattern matching (as opposed to just string matching) or if the content of your file isn't ascii. still worth doing in some cases but use caution. – Robert E Mealey Dec 18 '14 at 21:44
  • @RobertEMealey Did he say "Single" instead of "Simple"? – Elijah Lynn Jul 11 '17 at 01:33
  • Phenomenal improvement for me with LC_ALL=C. In my case I definitely did not need multi-byte searches nor regular expressions. I saw a 4X speed improvement, worth reworking my grep script from Windows to WSL bash and doing it from that environment to get that kind of improvement. My system is running on a Samsung 850 Pro. – Alz Jan 23 '18 at 22:16
12

Ripgrep claims to now be the fastest.

https://github.com/BurntSushi/ripgrep

Also includes parallelism by default

 -j, --threads ARG
              The number of threads to use.  Defaults to the number of logical CPUs (capped at 6).  [default: 0]

From the README

It is built on top of Rust's regex engine. Rust's regex engine uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.

rado
  • 4,040
  • 3
  • 32
  • 26
5

Apparently using --mmap can help on some systems:

http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Sandro Pasquali
  • 403
  • 4
  • 9
4

Not strictly a code improvement but something I found helpful after running grep on 2+ million files.

I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.

3

If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.

  1. If you've one very large file:

    parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>

  2. Many small compressed files (sorted by inode)

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

I usually compress my files with lz4 for maximum throughput.

  1. If you want just the filename with the match:

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

Alex V
  • 353
  • 4
  • 10
2

Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.

So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).

Chris
  • 1,479
  • 2
  • 15
  • 19
  • I'm showing BSD Grep faster on my 13" MacBook Pro than a 8GB, 6-core Linode while searching a 250 MB .sql dump file. 6 s vs 25 s – anthumchris Feb 25 '15 at 19:25
2

I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.

https://github.com/ggreer/the_silver_searcher

Update: I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.

Jinxmcg
  • 1,830
  • 1
  • 21
  • 24
  • I found a bug in this. Sometimes it does not go deep in the tree and i have cases where grep shows the result but ag does not. I can't compromise on accuracy for speed. – username_4567 May 25 '16 at 10:04
  • 1
    You should open a Issue on their github account and report it (I would do that but I can't replicate it), as until now I did not find any inaccuracies. For sure they will sort this out and yes you are right I totally agree: accuracy first. – Jinxmcg May 25 '16 at 10:09
1

One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:

Having a file of ids you want to search for in a big file called my_ids.txt Name of bigfile bigfile.txt

Use split to split the file into parts:

# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames 
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]

# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files

In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.

0

cgrep, if it's available, can be orders of magnitude faster than grep.

xhtml
  • 1
0

MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.

https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep

https://metacpan.org/release/MCE

One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.

Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.

Another alternate is the MCE::Grep module included with MCE.

Mario Roy
  • 352
  • 2
  • 3
0

A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:

Once you compile it (the golang package is needed), you can index a folder with:

# index current folder
cindex .

The index will be created under ~/.csearchindex

Now you can search:

# search folders previously indexed with cindex
csearch eggs

I'm still piping the results through grep to get colorized matches.

ccpizza
  • 28,968
  • 18
  • 162
  • 169