13
grep -F -f file1  file2

file1 is 90 Mb (2.5 million lines, one word per line)

file2 is 45 Gb

That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.

It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:

head file1  >  file3
grep -F -f file3   file2

I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.

I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?

cmo
  • 3,762
  • 4
  • 36
  • 64
  • 2
    Can you use the `split` command to break file1 into pieces? – Dan Pichelman May 02 '13 at 17:07
  • 1
    SQL commands will not generally help you with raw files. – likeitlikeit May 02 '13 at 17:08
  • 2
    @DanPichelman if he split the pattern file into 100 pcs, he has to play with the 45G monster 100 times..this is ok...**AND** he has to remove duplicated matched lines. since grep -f does "OR".... I don't know if it is faster. – Kent May 02 '13 at 17:10
  • what OS are you running on? My experience with `grep -F -f listFile` is that you'd get an error message saying `listFile too big` (or similar). Hm... Other readers... Isn't there something about `-f listFile` being a sorted file? ? Also, while SQL could solve this problem, there will be a hugh setup time to get SQL installed, cfged, etc. If your making a production process that will run regularly, it may be worth the time investment, but it's probably not in your project timeline. Good luck! – shellter May 02 '13 at 17:12
  • @shellter I'm on Debian – cmo May 02 '13 at 17:14
  • (See revised comment above). At least you're using a modern grep. I looked at my `man grep` and don't see anything about `-f fileList` needing to be sorted, but lets see if others can confirm that. ALSO, are you running a on large mutli-processor system? Maybe the Hadoop system could help, again with the same caveats as for installing SQL. Good luck. – shellter May 02 '13 at 17:14
  • @DanPichelman To be honest, I *hate* working with `split` and recombining the file pieces... – cmo May 02 '13 at 17:14
  • are you looking for whole words or are you doing full text search? You might want to try a hashtable-based search strategy. – Lie Ryan May 02 '13 at 17:17
  • @LieRyan whole words. What is a hash table strategy? – cmo May 02 '13 at 17:22
  • @shellter I am working on a big network, so `sqlite3` is already installed, I have checked. I just have *no idea* how to use it. – cmo May 02 '13 at 17:24
  • @Matthew Parks: you'll need to write a simple script. Either Python or Perl would be suitable. Simply said, you need to build an "index" to speed up the search. Can't say much more without seeing what your data looks like. – Lie Ryan May 02 '13 at 17:39
  • @MatthewParks: the top answer on this question might help: http://stackoverflow.com/questions/11490036/fast-alternative-to-grep-f You nigh also want to check out the `join` command suggested in the same page if those are suitable for your purpose. – Lie Ryan May 02 '13 at 17:44
  • @LieRyan Thank you, I had looked at that other question – cmo May 02 '13 at 18:12
  • 1
    You can just do cat file2, because if you grep out 2.5 million words from a file, almost all the lines are gonna show up eventually :) – Sidharth C. Nadhan May 03 '13 at 06:40

4 Answers4

18

Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes. Source: Grepping a huge file (80GB) any way to speed it up?

So what I do is:

LC_ALL=C fgrep "pattern" <input >output
Community
  • 1
  • 1
Mojing Liu
  • 203
  • 4
  • 7
5

I don't think there is an easy solution.

Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.

A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.

Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.

Your best bet is probably taking a step back and rethinking your problem.

Martin Drautzburg
  • 5,143
  • 1
  • 27
  • 39
4

You can try ack. They are saying that it is faster than grep.

You can try parallel :

parallel --progress -a file1 'grep -F {} file2'

Parallel has got many other useful switches to make computations faster.

Sidharth C. Nadhan
  • 2,191
  • 2
  • 17
  • 16
2

Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.

Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:

awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2

NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.

Otherwise, you'll need to iterate:

awk 'NR == FNR { query[$0]=1; next }
     { for (q in query) if (index($0, q)) { print; next } }' file1 file2

Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.

If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).

These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

Adam Katz
  • 14,455
  • 5
  • 68
  • 83