Grepping a huge file (80GB) any way to speed it up?

Question

 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

the grep is grabbing up to 93% CPU

Depending on your locale, the `-i` switch may slow the process down, try without `-i` or with `LC_ALL=C grep ...`. Also, if you're only grepping for a fixed string, use `grep -F`. — Thor, Dec 17 '12 at 11:19
As @dogbane mentioned using the **LC_ALL=C** variable along with **fgrep** can speed up your search.I did some testing and was able to achieve a **1400%** performance increase and wrote up a detailed article why this is in my [speed up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) post — JacobN, Aug 23 '13 at 17:57
I'm curious - what file is 80GB in size? I'd like to think that when a file gets that big, there may be a better storage strategy (e.g. rotating log files, or categorizing hierarchically into different files and folders). Also, if the changes only occur in certain places of the file (e.g. at the end), then just store some grep results from the earlier section that doesn't change and instead of grepping the original file, grep the stored result file. — Sridhar Sarnobat, Nov 14 '16 at 20:03
I settled on https://github.com/google/codesearch — both indexing and searching are lightning fast (written in Go). `cindex .` to index your current folder, then `csearch db_pd.Clients`. — ccpizza, Oct 28 '17 at 02:06
If your file were indexed or sorted, this could be made **vastly** faster. Searching every line is O(n) by definition, whereas a sorted file can be seeked by bisecting it -- at which point you'd be talking under a second to search your 80gb (hence why a 80gb indexed database takes no time at all for a simple SELECT, whereas your grep takes... well, as long as it takes). — Charles Duffy, Jan 18 '18 at 03:23

score 198 · Accepted Answer · answered Dec 17 '12 at 11:25

198

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

answered Dec 17 '12 at 11:25

dogbane

266,786
75
396
414

My grep has finally returned a result . Will try your grep suggestions and report the result. Must first however cut everything up to db_pd.Clients (illegal mysql table name) : sigh! – zzapper Dec 17 '12 at 11:33
9

that was MUCH quicker by an order of magnitude thanks. BTW I added -n to get the line numbers. Also maybe a -m to exit after match – zzapper Dec 17 '12 at 12:55
7

Wow thanks so much @dogbane great tip! This led me down a research tunnel to find out [why LC_ALL=C speeds up grep](http://www.inmotionhosting.com/support/website/how-to/speed-up-grep-searches-with-lc-all) and it was a very enlightening experience! – JacobN Aug 23 '13 at 18:06
weird that nobody have mentioned the --mmap flag. – sw. Mar 27 '14 at 19:20
11

Some people (not me) like `grep -F` more than `fgrep` – Walter Tross Jun 18 '14 at 09:21
2

My understanding is that `LANG=C` (instead of `LC_ALL=C`) is enough, and is easier to type. – Walter Tross Jun 18 '14 at 11:46
1

@WalterTross what's the diff between `grep` and `fgrep` ? – Bob Jun 07 '16 at 15:38
3

@Adrian `fgrep` is another way to write `grep -F`, as `man fgrep` will tell you. Some versions of the `man` also say that the former is deprecated for the latter, but the shorter form is too convenient to die. – Walter Tross Jun 07 '16 at 16:20
1

Why doesnt `LC_ALL=C` help for `bzgrep` then? – Emma He Jun 07 '17 at 02:05
1

Doesn't seem to help `zgrep` either. Also what's a RAM disk, and how do I copy a file to it and run `zgrep` over it? Any pointers? – amit_saxena Sep 04 '18 at 13:24
I tested fgrep on a 1.5 GB file, and to my surprise it took twice as long as a normal grep. LANG=C did not seem to make a significant difference. Is this answer still valid, or has grep been optimised since this this answer? – Onnonymous Dec 05 '18 at 10:34
Setting LC_ALL to C went from taking ~5 minutes to ~2 seconds for me piping from `cut` | `grep` (with Python regex) | `sed` (regex) | `sort` | `uniq -c`. That's crazy. – scorgn Oct 06 '21 at 18:23
@zzapper [`-n` will significantly slow down the search](https://stackoverflow.com/a/12630617/995714). However in this case if you only need to find the first line then it may be ok – phuclv Jun 11 '23 at 06:03

score 45 · Answer 2 · answered Dec 17 '12 at 12:49

45

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.

answered Dec 17 '12 at 12:49

Steve

51,466
13
89
103

3

If it is an actual file, use `--pipepart` instead of `--pipe`. It is much faster. – Ole Tange Jul 05 '16 at 06:51
This usage not support pattern include space, we need use like this: parallel --pipe --block 10M "/usr/bin/grep -F -C5 -e 'Animal Care & Pets'" – zw963 Jun 14 '17 at 07:40
What does it mean the `<` character preceding the parallel command? – elcortegano Oct 21 '19 at 15:22
1

@elcortegano: That's what's called [I/O redirection](https://www.tldp.org/LDP/abs/html/io-redirection.html). Basically, it reads input from the following filename. Similar to `cat file.sql | parallel ...` but avoids a [UUOC](https://stackoverflow.com/questions/11710552/useless-use-of-cat). GNU parallel also has a way to read input from a file using `parallel ... :::: file.sql`. HTH. – Steve Oct 21 '19 at 21:26
What if I wanna grep the whole file directory? – Yan Yang Apr 09 '21 at 11:19
@YanYang, `find /path/to/dir -type f | parallel grep "foo"` – Steve Apr 09 '21 at 13:12

score 10 · Answer 3 · answered Dec 17 '12 at 11:19

10

Some trivial improvement:

Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.

A single point is the regex symbol to match any character, which is also slow

answered Dec 17 '12 at 11:19

BeniBela

16,412
4
45
52

instead of replacing `.` by `\.` changing from a regex match to a fixed string match with `-F` will be significantly better – phuclv Jun 11 '23 at 06:04

score 3 · Answer 4 · answered Dec 17 '12 at 11:18

3

Two lines of attack:

are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

answered Dec 17 '12 at 11:18

Eugen Rieck

64,175
10
70
92

score 3 · Answer 5 · edited Jun 11 '23 at 05:48

3

Try ripgrep

It provides much better results compared to grep.

For example, on a live test (11Gb Mailbox archive)

rg (ripgrep)

time rg -c "^From " ~/Documents/archive.mbox
99176
rg -c "^From " ~/Documents/archive.mbox  
1.38s user 5.24s system 62% cpu 10.681 total

vs grep

time grep -c "^From " ~/Documents/archive.mbox
99176
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -c    
125.56s user 6.61s system 98% cpu 2:13.56 total

Note that I've had better rg results that 10sec (6sec best time so far) for the same 11Gb file. Grep consistently takes more than 2 mins.

edited Jun 11 '23 at 05:48

ocodo

29,401
18
105
117

answered Aug 25 '21 at 08:10

Shailesh

358
2
13

What is the reason for that? Is there a more clever implementation behind? – Filip Seman Jul 20 '23 at 16:27

score 1 · Answer 6 · answered Jan 18 '18 at 03:10

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

score 0 · Answer 7 · answered Jun 06 '22 at 15:29

All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.

However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -

startline=$(grep -n -m 1 "$start_pattern"  file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern"  file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))

Then work on this subset of logs!

RARE Kpop Manifesto · Answer 8 · 2022-06-06T18:47:03.270

hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :

rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.

and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits

rows       = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.


% dtp;  pvE0 < testfile_gigantic_001.txt| 
        mawk2 '
        _^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
                           ?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','     

  in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%            
 out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
  
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372

And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.

At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

Grepping a huge file (80GB) any way to speed it up?

8 Answers8

Linked