Fast alternative to grep -f

Question

file.contain.query.txt

ENST001

ENST002

ENST003

file.to.search.in.txt

ENST001  90

ENST002  80

ENST004  50

Because ENST003 has no entry in 2nd file and ENST004 has no entry in 1st file the expected output is:

ENST001 90

ENST002 80

To grep multi query in a particular file we usually do the following:

grep -f file.contain.query <file.to.search.in >output.file

since I have like 10000 query and almost 100000 raw in file.to.search.in it takes very long time to finish (like 5 hours). Is there a fast alternative to grep -f ?

What are your needs? do wou want a file with the lines of the second filtered with the keys of the first one? — Miguel Prz, Jul 15 '12 at 06:54
What locale? Try `LANG=C grep -F ...` and see if speed improves thanks to locale and (as @tripleee prudently advises) fixed string matching. — pilcrow, Jul 15 '12 at 23:59
See also [my awk answer](https://stackoverflow.com/a/38386426/519360) provided to [this similar question](https://stackoverflow.com/questions/16343776/grep-f-alternative-for-huge-files). — Adam Katz, Jan 12 '18 at 21:17

Alex Reynolds · Accepted Answer · 2012-07-15T07:35:54.637

If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:

#!/usr/bin/env perl
use strict;
use warnings;

# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
    chomp $_;
    $keyring->{$_} = 1;
}
close KEYS;

# look up key from each line of standard input
while (<STDIN>) {
    chomp $_;
    my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
    if (defined $keyring->{$key}) { print "$_\n"; }
}

You'd use it like so:

lookup.pl < file.to.search.txt

A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.

This is Ferrari when compare with grep -f .. Thanks – user1421408 Jul 15 '12 at 07:22 — user1421408, Jul 15 '12 at 07:22

score 8 · Answer 2 · answered Jul 15 '12 at 08:17

8

If you have fixed strings, use grep -F -f. This is significantly faster than regex search.

answered Jul 15 '12 at 08:17

tripleee

175,061
34
275
318

score 5 · Answer 3 · answered Jul 15 '12 at 07:07

This Perl code may helps you:

use strict;
open my $file1, "<", "file.contain.query.txt" or die $!;
open my $file2, "<", "file.to.search.in.txt" or die $!;

my %KEYS = ();
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file

while(my $line=<$file1>) {
    chomp $line;
    $KEYS{$line} = 1;
}

while(my $line=<$file2>) {
    if( $line =~ /(\w+)\s+(\d+)/ ) {
        print "$1 $2\n" if $KEYS{$1};
    }
}

close $file1;
close $file2;

Your forgot to check the return value of the syscalls. – tchrist Jul 15 '12 at 16:08 — tchrist, Jul 15 '12 at 16:08

score 4 · Answer 4 · answered Jul 15 '12 at 11:01

4

If the files are already sorted:

join file1 file2

if not:

join <(sort file1) <(sort file2)

answered Jul 15 '12 at 11:01

Dennis Williamson

346,391
90
374
439

Chris Charley · Answer 5 · 2012-07-15T17:05:51.410

If you are using perl version 5.10 or newer, you can join the 'query' terms into a regular expression with the query terms separated by the 'pipe'. (Like:ENST001|ENST002|ENST003) Perl builds a 'trie' which, like a hash, does lookups in constant time. It should run as fast as the solution using a lookup hash. Just to show another way to do this.

#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files;

my $query = join "|", map {chomp; $_} <QUERY>;

while (<RAW>) {
    print if /^(?:$query)\s/;
}

__QUERY__
ENST001
ENST002
ENST003
__RAW__
ENST001  90
ENST002  80
ENST004  50

Abé Wickham · Answer 6 · 2012-07-15T07:29:15.597

Mysql:

Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.

mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt 

# but first you need to create the tables like this (only once off)

create table contains (
   keyword   varchar(255)
   , primary key (keyword)
);

create table search (
   keyword varchar(255)
   ,num bigint
   ,key (keyword)
);

# and load the data in:

load data infile 'file.contain.query.txt' 
    into table contains fields terminated by "add column separator here";
load data infile 'file.to.search.in.txt' 
    into table search fields terminated by "add column separator here";

I haven't tested this but it will work with a bit of tweaking depending on your situation. It will take very little memory unless you want it to be ram based. — Abé Wickham, Jul 15 '12 at 07:19

perreal · Answer 7 · 2012-07-15T07:35:27.370

use strict;
use warings;

system("sort file.contain.query.txt > qsorted.txt");
system("sort file.to.search.in.txt  > dsorted.txt");

open (QFILE, "<qsorted.txt") or die();
open (DFILE, "<dsorted.txt") or die();


while (my $qline = <QFILE>) {
  my ($queryid) = ($qline =~ /ENST(\d+)/); 
  while (my $dline = <DFILE>) {
    my ($dataid) = ($dline =~ /ENST(\d+)/);
    if ($dataid == $queryid)   { print $qline; }
    elsif ($dataid > $queryid) { break; } 
  }
}

score 0 · Answer 8 · answered Mar 14 '19 at 14:49

This may be a little dated, but is tailor-made for simple UNIX utilities. Given:

keys are fixed-length (here 7 chars)
files are sorted (true in the example) allowing the use of fast merge sort

Then:

$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7

ENST002  80

ENST001  90

Variants:

To strip the number printed after the key, remove tac command:

$ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7

To keep sorted order, add an extra tac command at the end:

$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac

Fast alternative to grep -f

8 Answers8

Linked