152
 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

zzapper
  • 4,743
  • 5
  • 48
  • 45
  • 9
    Depending on your locale, the `-i` switch may slow the process down, try without `-i` or with `LC_ALL=C grep ...`. Also, if you're only grepping for a fixed string, use `grep -F`. – Thor Dec 17 '12 at 11:19
  • 8
    As @dogbane mentioned using the **LC_ALL=C** variable along with **fgrep** can speed up your search.I did some testing and was able to achieve a **1400%** performance increase and wrote up a detailed article why this is in my [speed up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) post – JacobN Aug 23 '13 at 17:57
  • 1
    I'm curious - what file is 80GB in size? I'd like to think that when a file gets that big, there may be a better storage strategy (e.g. rotating log files, or categorizing hierarchically into different files and folders). Also, if the changes only occur in certain places of the file (e.g. at the end), then just store some grep results from the earlier section that doesn't change and instead of grepping the original file, grep the stored result file. – Sridhar Sarnobat Nov 14 '16 at 20:03
  • I settled on https://github.com/google/codesearch — both indexing and searching are lightning fast (written in Go). `cindex .` to index your current folder, then `csearch db_pd.Clients`. – ccpizza Oct 28 '17 at 02:06
  • 1
    If your file were indexed or sorted, this could be made **vastly** faster. Searching every line is O(n) by definition, whereas a sorted file can be seeked by bisecting it -- at which point you'd be talking under a second to search your 80gb (hence why a 80gb indexed database takes no time at all for a simple SELECT, whereas your grep takes... well, as long as it takes). – Charles Duffy Jan 18 '18 at 03:23

8 Answers8

198

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

dogbane
  • 266,786
  • 75
  • 396
  • 414
  • My grep has finally returned a result . Will try your grep suggestions and report the result. Must first however cut everything up to db_pd.Clients (illegal mysql table name) : sigh! – zzapper Dec 17 '12 at 11:33
  • 9
    that was MUCH quicker by an order of magnitude thanks. BTW I added -n to get the line numbers. Also maybe a -m to exit after match – zzapper Dec 17 '12 at 12:55
  • 7
    Wow thanks so much @dogbane great tip! This led me down a research tunnel to find out [why LC_ALL=C speeds up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) and it was a very enlightening experience! – JacobN Aug 23 '13 at 18:06
  • weird that nobody have mentioned the --mmap flag. – sw. Mar 27 '14 at 19:20
  • 11
    Some people (not me) like `grep -F` more than `fgrep` – Walter Tross Jun 18 '14 at 09:21
  • 2
    My understanding is that `LANG=C` (instead of `LC_ALL=C`) is enough, and is easier to type. – Walter Tross Jun 18 '14 at 11:46
  • 1
    @WalterTross what's the diff between `grep` and `fgrep` ? – Bob Jun 07 '16 at 15:38
  • 3
    @Adrian `fgrep` is another way to write `grep -F`, as `man fgrep` will tell you. Some versions of the `man` also say that the former is deprecated for the latter, but the shorter form is too convenient to die. – Walter Tross Jun 07 '16 at 16:20
  • 1
    Why doesnt `LC_ALL=C` help for `bzgrep` then? – Emma He Jun 07 '17 at 02:05
  • 1
    Doesn't seem to help `zgrep` either. Also what's a RAM disk, and how do I copy a file to it and run `zgrep` over it? Any pointers? – amit_saxena Sep 04 '18 at 13:24
  • I tested fgrep on a 1.5 GB file, and to my surprise it took twice as long as a normal grep. LANG=C did not seem to make a significant difference. Is this answer still valid, or has grep been optimised since this this answer? – Onnonymous Dec 05 '18 at 10:34
  • Setting LC_ALL to C went from taking ~5 minutes to ~2 seconds for me piping from `cut` | `grep` (with Python regex) | `sed` (regex) | `sort` | `uniq -c`. That's crazy. – scorgn Oct 06 '21 at 18:23
  • @zzapper [`-n` will significantly slow down the search](https://stackoverflow.com/a/12630617/995714). However in this case if you only need to find the first line then it may be ok – phuclv Jun 11 '23 at 06:03
45

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

  • Dropping the -i flag.
  • Using the -F flag for a fixed string
  • Disabling NLS with LANG=C
  • Setting a max number of matches with the -m flag.
Steve
  • 51,466
  • 13
  • 89
  • 103
  • 3
    If it is an actual file, use `--pipepart` instead of `--pipe`. It is much faster. – Ole Tange Jul 05 '16 at 06:51
  • This usage not support pattern include space, we need use like this: parallel --pipe --block 10M "/usr/bin/grep -F -C5 -e 'Animal Care & Pets'" – zw963 Jun 14 '17 at 07:40
  • What does it mean the `<` character preceding the parallel command? – elcortegano Oct 21 '19 at 15:22
  • 1
    @elcortegano: That's what's called [I/O redirection](https://www.tldp.org/LDP/abs/html/io-redirection.html). Basically, it reads input from the following filename. Similar to `cat file.sql | parallel ...` but avoids a [UUOC](https://stackoverflow.com/questions/11710552/useless-use-of-cat). GNU parallel also has a way to read input from a file using `parallel ... :::: file.sql`. HTH. – Steve Oct 21 '19 at 21:26
  • What if I wanna grep the whole file directory? – Yan Yang Apr 09 '21 at 11:19
  • @YanYang, `find /path/to/dir -type f | parallel grep "foo"` – Steve Apr 09 '21 at 13:12
10

Some trivial improvement:

  • Remove the -i option, if you can, case insensitive is quite slow.

  • Replace the . by \.

    A single point is the regex symbol to match any character, which is also slow

BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • instead of replacing `.` by `\.` changing from a regex match to a fixed string match with `-F` will be significantly better – phuclv Jun 11 '23 at 06:04
3

Two lines of attack:

  • are you sure, you need the -i, or do you habe a possibility to get rid of it?
  • Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
Eugen Rieck
  • 64,175
  • 10
  • 70
  • 92
3

Try ripgrep

It provides much better results compared to grep.

For example, on a live test (11Gb Mailbox archive)

rg (ripgrep)

time rg -c "^From " ~/Documents/archive.mbox
99176
rg -c "^From " ~/Documents/archive.mbox  
1.38s user 5.24s system 62% cpu 10.681 total

vs grep

time grep -c "^From " ~/Documents/archive.mbox
99176
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -c    
125.56s user 6.61s system 98% cpu 2:13.56 total

Note that I've had better rg results that 10sec (6sec best time so far) for the same 11Gb file. Grep consistently takes more than 2 mins.

ocodo
  • 29,401
  • 18
  • 105
  • 117
Shailesh
  • 358
  • 2
  • 13
1
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

user584583
  • 1,242
  • 4
  • 18
  • 35
0

All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.

However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -

startline=$(grep -n -m 1 "$start_pattern"  file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern"  file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))

Then work on this subset of logs!

Smita
  • 1
0

hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :

rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.

and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits

rows       = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.


% dtp;  pvE0 < testfile_gigantic_001.txt| 
        mawk2 '
        _^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
                           ?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','     

  in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%            
 out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
  
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372

And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.

At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11