grep a large list against a large file

Question

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).

I want all the csv lines, that contain an id from the id file.

My naive approach was:

cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done

But this takes forever!

Are there more efficient approaches to this problem?

To find all of the strings inside a file, you can run grep in FOR loop: https://unix.stackexchange.com/a/462445/43233 — Noam Manos, Aug 14 '18 at 06:54

score 52 · Accepted Answer · answered Oct 15 '13 at 12:14

Try

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

score 25 · Answer 2 · answered Oct 15 '13 at 12:13

Use grep -f for this:

grep -f the_ids.txt huge.csv > output_file

From man grep:

-f FILE, --file=FILE

Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)

If you provide some sample input maybe we can even improve the grep condition a little more.

Test

$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye

$ grep -f ids huge.csv 
hello this is 11 but
and here 23

codeforester · Answer 3 · 2018-03-15T13:46:42.803

grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:

use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option

This post has a great discussion on this topic (grep -f on large files):

Fastest way to find lines of a file from another larger file in Bash

And this post talks about grep -vf:

grep -vf too slow with large files

In summary, the best way to handle grep -f on large files is:

Matching entire line:

awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt

and for grep -vf:

Matching entire line:

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt

score 0 · Answer 4 · answered Jan 22 '20 at 20:39

0

You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:

ugrep -F -f the_ids.txt huge.csv

This works with GNU grep too, but I expect ugrep to run several times faster.

answered Jan 22 '20 at 20:39

Dr. Alex RE

1,772
1
15
23

grep a large list against a large file

4 Answers4

Test

Linked

Related