8

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:

#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done

The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.

there are similar questions:

Loop Running VERY Slow :

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me. or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now) Can anyone see a faster way to do this? Thank you in advance

Community
  • 1
  • 1
jksl
  • 323
  • 5
  • 13
  • Use `fgrep --word-regexp` unless you need substring-matching. Also try `fgrep --files-with-matches` if you only want to know the matching filenames. – Perleone Jan 03 '13 at 18:07
  • 1
    Since `i` is already a variable, there is no need to spawn a subshell just to assign its value to another variable called `LOOK`. – chepner Jan 03 '13 at 18:40
  • As an aside, the needless and inelegant backticks also contribute to the slowdown, though probably no more than a few per cent of the overall processing time. http://partmaps.org/era/unix/award.html#backticks – tripleee Jan 03 '13 at 19:58
  • @tripleee , i can't see the backticks, which do you mean? – jksl Jan 07 '13 at 08:57
  • Sorry, I was inexactly referring to the `$(command)` subsitutions (the legacy Bourne syntax for this construct used backticks). – tripleee Jan 07 '13 at 10:07
  • See this post: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:36

5 Answers5

5

sounds like the -f flag for grep would be suitable here:

-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)

so grep can already do what your loop is doing, and you can replace the loop with:

grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt

Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

Martin
  • 37,119
  • 15
  • 73
  • 82
  • sorry I meant to say I was using grep -f first, then I saw something about fgrep being faster, I might edit the original to mention that. I see now from your answer that actually they need to be used in combination > no I'm confused. Looking at my script, I'm re-editing back to how it was. Sorry again – jksl Jan 03 '13 at 16:52
  • Yes, `fgrep` means `grep -F`(upper case F), which is slightly faster since it treats patterns as fixed strings and not regular expressions. What I'm proposing is that you also add `-f` (lower case f) to load patterns from a file. This way you will eliminate the overhead of `grep`'s startup – Martin Jan 03 '13 at 16:56
2

As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.

Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.

Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.

The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.

EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.

Community
  • 1
  • 1
Hari Menon
  • 33,649
  • 14
  • 85
  • 108
1

Here's one way to speed things up:

while read i
do
    LOOK=$(echo $i)
    fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile

When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.

By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.

If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.

while read i do LOOK=$(echo $i) find $i -type f | xargs fgrep $LOOK >> /data/output.txt done < /data/datafile

The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:

find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt

On the Mac it's

find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt

As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.

Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.

David W.
  • 105,218
  • 39
  • 216
  • 337
  • Ah, thanks for the detailed explanation this makes it much clearer for me. xargs should work I have no whitespace. Okay i'll try but not with python this time. I've never used python but I think i will try and learn some my perl is even shakier than my bash knowledge but it should be simple enough . – jksl Jan 07 '13 at 09:01
  • also I need the line from the file, not the location, so I think find not so useful. thanks – jksl Jan 07 '13 at 09:04
  • The `$(echo)` still seems completely superfluous and wasteful. – tripleee Jan 07 '13 at 10:08
  • Using `find` to run `fgrep` in each file will indeed extract the matching lines, not file names (which is what I imagine you mean with "locations"). – tripleee Jan 07 '13 at 10:09
  • `xargs -0` should work on other platforms too, at least those with GNU `find`. – tripleee Jan 07 '13 at 10:11
1

Since you are searching for simple strings (and not regexp) you may want to use comm:

comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt

It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.

On a 8 core this takes 100 sec on these files:

$ wc find_this; cat in_this.* | wc
3637371   4877980 307366868 find_this
16000000 20000000 1025893685

Be sure to have a reasonably new version of sort. It should support --parallel.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
0

You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.

Another hint: you can combine strings that you are looking for in one regular expression. In this case grep will do only one pass for all combined lines.

Example:

Instead of

for i in ABC DEF GHI JKL
do
grep $i file >> results
done

you can do

egrep "ABC|DEF|GHI|JKL" file >> results
Igor Chubin
  • 61,765
  • 13
  • 122
  • 144
  • hmm, I think it would be a bit to long to put 20M matches in but I could wrap it in a script I suppose, might try this, thank you for replying – jksl Jan 07 '13 at 08:59