4

Where grep finds a short pattern from a pattern file in long lines of a look-up file, I need a tool that would allow me to extract short lines of a lookup file that can be found within a longer pattern.

In other words, given the works of Shakespeare with one sentence per line and say a French dictionary, I want to find which French words are found in which line of Shakespeare, allowing for the detection of the fact that a line of Shakespeare may contain more than one French word and that a French word may appear in more than one line of Shakespeare.

For example:

pattern_file={
"The sun is shining!"
"It is a beautiful day!"}

lookup_file={
"Rain"
"Sun"
"Cloud"
"Beautiful"
"Shining"}

What I would like is

function file pattern

To give both the line that is found in the longer pattern and the longer pattern itself separated by a coma, with multiple matches being detected.

ideal_result_file={
"Sun","The sun is shining!"
"Beautiful","It is a beautiful day!",
"Shining", "The sun is shining!"}

Currently, I loop over the whole lookup file line by line with grep:

    while read line
    do
      grep  -is $line pattern_file | sed 's/^/'"$line"'\,/g' >> result_file.csv
    done < lookup_file

This is incredibly slow! My lookup_file contains over 50 000 lines while my pattern_file contains 500. Where as using grep to find an even shorter pattern in my lookup_file takes seconds, a single pass using my loop approach takes day/weeks.

Solutions in any language would be appreciated.

Somewhat related to
Very slow loop using grep or fgrep on large datasets
Is Perl faster than bash?

The solution needs to be compatible with GB size loopup and pattern files.

Community
  • 1
  • 1
Etienne Low-Décarie
  • 13,063
  • 17
  • 65
  • 87
  • Does lookup_file consist of plain text as shown or does it have regular expressions? – Joni Mar 29 '13 at 22:24
  • lookup_file is plain text – Etienne Low-Décarie Mar 31 '13 at 22:18
  • Can this while loop be vectorized? Or translated into another (compiled) language that would be more efficient? – Etienne Low-Décarie Mar 31 '13 at 22:19
  • I don't think any compiled language is going to be more efficient than `grep`. Anyway, using `grep -F -f /usr/share/dict/words` (99 thousand words) on a 2000-word text file runs in less than a second, though it does produce only the longest match (e.g. `anything` will produce a match for `anything` and not for `any`). You want the output to show all matches? – Joni Apr 01 '13 at 11:03
  • Yes, all matches would be necessary. Nothing could be faster than grep if I could use grep on its own here, but maybe the looping approach with grep would be faster in another language? – Etienne Low-Décarie Apr 02 '13 at 15:37
  • The basic problem is that shorter patterns are substrings in longer patterns, for example `any` in `anything`. What you can do is to group the lookup words by length into different files, and loop over those files with `grep` - if two different words are the same length one can't be a substring of another. I'll edit the answer to explain this idea further. – Joni Apr 02 '13 at 17:18
  • Ah, but what about overlapping patterns? If your lookup words include `ab` and `bc` you want two matches for `abc`? – Joni Apr 02 '13 at 17:23
  • Yes, overlapping would be needed to be noted twice as in two counts for ab and bc in abc. – Etienne Low-Décarie Apr 06 '13 at 19:43
  • You could have said you were working with DNA data in the first place. Bioinformatics is a field of its own with many different algorithms for finding matches in DNA; using English words as an example is a red herring. – Joni Apr 09 '13 at 13:38
  • Apologies for that, I thought the comparison was adequate. – Etienne Low-Décarie Apr 09 '13 at 14:40

9 Answers9

6

You can use the -f switch to use a "pattern file" in grep:

egrep -i -f lookup_file pattern_file >> result_file

This will be faster because grep compiles lookup_file into a single state machine that checks all matches at the same time, rather than checking each pattern against each line separately.

If your lookup_file consists of text and not regular expressions, you can use fgrep and it will be even faster.

To get your ideal output you can use the -n and -o switches and you get a list of patterns that match each line.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Thank you, however I was well aware of the -f flag. The problem is not using a file rather than looping, it is that in this situation I have no choice but to loop if I am to use grep. It is that my lookup_string is shorter and is what I want to find within the longer pattern. Admittedly I am not explaining this well, but I have not found a better way to explain it. – Etienne Low-Décarie Mar 29 '13 at 14:06
  • Why would you have to use a loop? The `-f` does not just solve the need for a loop; it makes `grep` compile the entire file into a single regular expression which is much better than a loop. – Joni Mar 29 '13 at 14:54
  • Etienne, This solution suggested by Joni seems to be the most sensible one. Another way to do this would be to do like databases do, create a patterns table, and create an index table that further indexes each pattern into smaller component patterns. By the way, the language of your problem is super confusing, you have to stop thinking of your patterns as patterns anymore (even thought, that's what I called them), you have to think of your patterns as data from now on (just like you would think of the content of a book that gets indexed to create the dictionary of a spellchecker). – Stephan Branczyk Mar 29 '13 at 17:02
  • I agree that the phrasing of my problem is confusing, because this is the inverse problem usually solved by grep (short pattern in long line of file), please feel free to edit. Switching the what I call a pattern and the lookup file (as proposed by glenn jackman) and using the -f flag (as proposed by Joni) does get me closer but, as commented below, there are a number of difficulties with this simple switch, amongst them loading the whole 50 000 line file as a pattern uses too much ram and this does not allow the detection of multiple short segments being detected in the long "pattern". – Etienne Low-Décarie Mar 29 '13 at 21:41
  • @EtienneLow-Décarie how much ram does loading your files require? I've named the files "words" and "sentences" so their meaning is more directly clear, and here's the results of 'egrep -o -n -i -f words.txt sentences.txt', which seems to be very close to what you want: 1:sun shining 2:beautiful – Sparr Apr 02 '13 at 23:00
3

Since you indicated any language is acceptable I will post a completely different approach: with shell scripting you will never beat the performance of in-memory tools or databases. If you have a lot of data you should use databases which are meant for these kind of operations and it scales much better.

So here is a simple example using sqlite (www.sqlite.org).

You need to import your patterns and data into tables, like this for example (you can script this if you want):

CREATE TABLE patterns (pattern TEXT);
CREATE TABLE data (sentence TEXT);

BEGIN;

INSERT INTO patterns VALUES ('Sun');
INSERT INTO patterns VALUES ('Rain');
INSERT INTO patterns VALUES ('Cloud');
INSERT INTO patterns VALUES ('Beautiful');

INSERT INTO data VALUES ('The sun is shining');
INSERT INTO data VALUES ('It is a beautiful day');
INSERT INTO data VALUES ('It is cloudy and the sun shines');

COMMIT;

Then run a select query to get your desired output:

select pattern, group_concat(sentence) as doesmatch from (
    select pattern, sentence, lower(pattern) as lpattern, lower(sentence) as lsentence
    from patterns left outer join data
    where like('%' || lpattern || '%', lsentence)
) group by pattern;

If you save the first snippet as data.sql and the second one as query.sql you use this on the command line:

sqlite3 sentences.db < data.sql    # this imports your data, run once
sqlite3 sentences.db < query.sql

This gives you:

Beautiful|It is a beautiful day
Cloud|It is cloudy and the sun shines
Sun|The sun is shining,It is cloudy and the sun shines

which is what you want I believe. To make it more fancy use your favourite more advanced tool with a database library. I would choose python for this.

Suggestions for further improvement:

  • use regex instead of like to filter whole words (i.e. pattern "sun" matches "sun" but not "sunny"),

  • import utility,

  • output formatting,

  • query optimization.

Martijn de Milliano
  • 3,908
  • 3
  • 36
  • 45
  • Though I should be, I am not terribly familiar with databases. Reading around though, it seems like reading the files into the database may become the new bottleneck here (http://stackoverflow.com/questions/5942402/python-csv-to-sqlite) as I have to do this for all pattern and lookup files I wish to process. Any suggestions on how to do this efficiently? – Etienne Low-Décarie Apr 02 '13 at 19:51
  • @EtienneLow-Décarie That Python code you pointed to is slow because it does not wrap the statements in a transaction. Just edited my example to take that into account. I'm not sure how much it will help but you could use a compiled language like C or Java to import/export your data into the database, you might get 50% but your mileage may vary. Depending on the amount of data the database might fit in memory, which will be faster too. – Martijn de Milliano Apr 02 '13 at 20:03
  • Starting to get a handle of this, thank you so much! However in your code, your tables and their columns are named the same, would you mind editing for clarity? – Etienne Low-Décarie Apr 08 '13 at 13:24
  • That is great! I have tried loading my data into a table in an sqlite database (Using SQLite database browser to get me started), even with a test data set (100 mb) far smaller than a single one of my files (>1 gb), this appears to be the bottle neck, though once this step is done it would speed up queries in the future. – Etienne Low-Décarie Apr 08 '13 at 19:43
3

Your solution may actually be slow because it creates 50.000 processes all reading the 500 lines pattern_file.

Another "pure bash & unix utils" solution could be to let grep do what it can do best and just match the output against your pattern_file.

So use grep to find matching lines and the parts that actually do match.

I use word matching here, which can be turned off by removing the -w switch in the grep line and to get initial behavior as described in your example.

The output is not yet redirected to result_file.csv.. which is easy to add later 8)

#!/bin/bash
# open pattern_file
exec 3<> pattern_file

# declare and initialize integer variables
declare -i linenr
declare -i pnr=0

# loop for reading from the grep process
#
# grep process creates following output:
#   <linenumber>:<match>
# where linenumber is the number of the matching line in pattern_file
# and   match is the actual matching word (grep -w) as found in lookup_file
# grep output is piped through sed to actually get
#   <linenumber> <match>
while read linenr match ; do

   # skip line from pattern_file till we read the line
   # that contained the match
   while [[ ${linenr} > ${pnr} ]] ; do
       read -u 3 pline
       pnr+=1
   done

   # echo match and line from pattern_file
   echo "$match, $pline"
done < <( grep -i -w -o -n -f lookup_file pattern_file | sed -e 's,:, ,' )

# close pattern_file
exec 3>&-

result is

sun, The sun is shining
shining, The sun is shining
beautiful, It is a beautiful day!

for the example given. Attention: the match is now the exact match where the case is preserved. So this does not results in Sun, ... but in sun, ....

The result is a script which reads pattern_files once using a grep which in the best case reads pattern_file and lookup_file once - depending on the actual implementation. It only starts two additional processes: grep and sed. (if needed, sed can be replaced by some bash substitution within the outer loop)

I did not try it with 50.000 line lookup_file and 500 lines pattern_file though. But I think it may be as fast as grep can be.

As long as grep can keep the lookup_file in memory it may be reasonable fast. (Who knows)

No matter if it solves your problem I would be interested how it performs compared to your initial script since I do lack nice test files.

If grep -f lookup_file uses too much memory (as you mentioned in a comment before) it may be a solution to split it in portions that actually do fit into memory and run the script more then once or use more then one machine, run all parts on those machines and just collect and concatenate the results. As long as the lookup_files do not contain dupes, you can just concatenate the results without checking for dupes. If sorting matters, You can sort all single results and then merge them quiet fast using sort -m.

Splitting up the lookup_file should not affect runtimes dramatically as long as you split the lookup_file only once and rerun the script, since your pattern_file may be small enough with its 500 lines to stay in memory cache anyway!? The same may be true for the lookup_file if you use more then one machine - its parts may just stay in memory on every machine.

EDIT:

As pointed out in my comment this will not work for overlapping files out of the box since grep -f seems to return only the longest match and will not rematch so if lookup_file contains

Sun
Shining
is
S

the result will be

sun, The sun is shining
is, The sun is shining
shining, The sun is shining

and not

sun, The sun is shining
is, The sun is shining
shining, The sun is shining
s, The sun is shining
s, The sun is shining
s, The sun is shining

So all the matching s (it matches three times) are missing.

In fact this is another issue with this solution: If a string is found twice it will be matched twice and identical lines will be returned, which can be removed by uniq.

Possible workaround: Split the lookup_file by string length of search strings. Which will decrease maxmimum memory needed for a run of grep but also slow down the whole thing a little bit. But: You can then search in parallel (and may want to check greps --mmap option if doing that on the same server).

m13r
  • 2,458
  • 2
  • 29
  • 39
mariux
  • 2,807
  • 12
  • 21
  • Thanks for this answer. I like the two step strategy (get matches then get the data from the matches). Currently, your approach with the info provided produces `sun, Rain , Rain beautiful, Sun` (there are empty matches). I'll try to wrap my mind around it to get it to work. Cheers – Etienne Low-Décarie Apr 06 '13 at 20:54
  • Is this with the data your provided for testing? Is it (`sun, Rain , Rain beautiful, Sun`) all printed in one line? Thats weird. I tried it with empty lines in `pattern_file` and `lookup_file`. It was always the result I quoted above. What version of bash are you using? (I currently use 4.1.7) – mariux Apr 07 '13 at 15:29
  • ok.. after reading your latest comments to the other posts: yes, this solution fails on overlapping matches. I think grep only returns the longest match and then ignores any submatches so if your `lookup_file` contains `GATTACA` and `GATTA` A match to `CGATGATTACAGGG` will only return `GATTACA, CGATGATTACAGGG` and `GATTA, ...` will be missing. I don't think `grep` can do that - at least I did not find a commandline option. (But the result you got is still not explained by this issue) It just shows, `grep -f` will not help at all. – mariux Apr 07 '13 at 15:44
  • Edited the answer to address the overlapping matches. – mariux Apr 07 '13 at 16:13
2

You need to swap the meanings of the "pattern" and "lookup" files, and use grep's -o switch.

$ cat patterns 
The sun is shining!
It is a beautiful day!

$ cat lookup 
Rain
Sun
Cloud
Beautiful

$ grep -iof lookup patterns 
sun
beautiful
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • This does not work for `lookup=[sun,sunny]` and `patterns=[sunny]` – Jo So Mar 29 '13 at 15:35
  • Thanks for this idea, however such a switch is not all that simple. There are a number of difficulties with this simple switch, amongst them loading the whole 50 000 line file as a pattern uses too much ram and this does not allow the detection of multiple short segments being detected in the long "pattern". – Etienne Low-Décarie Mar 29 '13 at 21:24
2

EDIT: Sorry, previous example did not work.

This seems like a perfect match for perl. Start with

#!/usr/bin/perl

open PATTERNS, "patterns";
open LOOKUP, "lookup";

my @l = <LOOKUP>;

while (chomp(my $re = <PATTERNS>)) {
     print "$re\n" if grep(/$re/, @l); 
}

Note that I've switched the meaning of pattern and lookup here. A pattern is a pattern. If you want to print patterns instead of lines, that's fine, but I wouldn't change their names.

Jo So
  • 25,005
  • 6
  • 42
  • 59
2

Use a hash table or set (depending on your language) to store the dictionary in all lower-case. For each line, split the line into an array of words based on non-alpha characters. Build a miniature hash table based on those words, converted to lower case, to eliminate duplicates. Iterate through each word in that miniature hash table, verifying whether it exists in your dictionary hash table. If it exists, print the word and the entire line.

Here's an implementation of this in Perl.

#! /usr/bin/perl

my $dictFile=$ARGV[0];
my $srchFile=$ARGV[1];
(-f $dictFile and -f $srchFile) or die "Usage: $0 dictFile srchFile";

# Load dictionary into hash table
my %dict=();
open($df, "<$dictFile") or die "Cannot open $dictFile";
while (<$df>) {
  chomp;
  $dict{lc($_)}=1;
}

# Search file for your dictionary words
open($sf, "<$srchFile") or die "Cannot open $srchFile";
my $lineNo=0;
while ($line=<$sf>) {
  $lineNo++;
  chomp($line);
  my %words=();
  my @sentence=split(/[^a-zA-ZÀ-ÿ0-9]+/, $line);
  foreach $word (@sentence) {
    $words{lc($word)}=1;
  }
  while ( my ($key) = each(%words) ) {
    if ($dict{$key}) {
      print "$lineNo, $key, $line\n";
    }
  }
}

pattern.txt

The sun is shining!
It is a beautiful day!

lookup.txt

Rain
Sun
Cloud
Beautiful
Shining

$ ./deepfind lookup.txt pattern.txt

1, shining, The sun is shining!
1, sun, The sun is shining!
2, beautiful, It is a beautiful day!

EDIT: Based on your comments, here's an alternate approach to defining the set of "words" in the "sentence". This prepares all viable sequences matching the length of any sequence found in the dictionary.

#! /usr/bin/perl
my $dictFile=$ARGV[0];
my $srchFile=$ARGV[1];
(-f $dictFile and -f $srchFile) or die "Usage: $0 dictFile srchFile";
# Load sequence dictionary into hash table
my %dict=();
my %sizes=();
open($df, "<$dictFile") or die "Cannot open $dictFile";
while (<$df>) {
  chomp;
  $dict{lc($_)}=1;
  $sizes{length($_)}=1;
}

# Search file for known sequences
open($sf, "<$srchFile") or die "Cannot open $srchFile";
my $lineNo=0;
while ($line=<$sf>) {
  $lineNo++;
  chomp($line);
  # Populate a hash table with every unique sequence that could be matched
  my %sequences=();
  while ( my ($size) = each(%sizes) ) {
    for (my $i=0; $i <= length($line)-$size; $i++) {
      $sequences{substr($line,$i,$size)}=1;
    }
  }
  # Compare each sequence with the dictionary of sequences.
  while ( my ($sequence) = each(%sequences) ) {
    if ($dict{$sequence}) {
      print "$lineNo, $sequence, $line\n";
    }
  }
}
m13r
  • 2,458
  • 2
  • 29
  • 39
phatfingers
  • 9,770
  • 3
  • 30
  • 44
  • A key benefit to this approach is that it supports a very large dictionary without impacting performance, provided that you have sufficient RAM allocated to hold your dictionary in memory. The complexity of your regex is unchanged-- it's only used to tokenize each line into words. – phatfingers Apr 06 '13 at 15:23
  • That is an amazing solution! Thank you very much! The Shakespear example was hypothetical, I am now trying to make this work were the sentences can not be broken down into words (the sentences and words are actually DNA code in my real life problem). – Etienne Low-Décarie Apr 06 '13 at 20:47
  • The search words must fit into a sentence in which words are not separated at all. – Etienne Low-Décarie Apr 06 '13 at 21:06
  • That's fascinating! Is there much variety in the lengths of the sequences that would be represented in your dictionary? It sounds as though the "words" would be substrings that could start and end at any point. If so, your %words, could just be populated with every possible substring within a line for every valid length (where a valid length is defined by the set of lengths of words found in your dictionary). – phatfingers Apr 06 '13 at 22:21
  • Yep, the words can start anywhere and end anywhere, hence the use of grep. – Etienne Low-Décarie Apr 06 '13 at 22:33
  • I added a revised version which uses no regular expressions. – phatfingers Apr 06 '13 at 22:57
  • Dear phatfingers, you two scripts are great/amazingly fast. I hesitate to ask for more! I was wondering how to integrate if the lookup file also had a definition associated the words which you wanted to obtain in your results (e.g. `lookup_file={ Rain: The condensed moisture of the atmosphere falling visibly in separate drops. Sun: The star around which the earth orbits. ...}` with results `ideal_result_file={ "Sun:The star around which the earth orbits.","The sun is shining!"} – Etienne Low-Décarie Apr 12 '13 at 02:17
  • Please feel free to contact me directly through my profile. It's better for people to not wade through too much cruft in the comments. – phatfingers Apr 12 '13 at 19:53
0

How about using something like a suffix array or a suffix array? You can find an implementation that has the advantage of sticking to grep-like options here, though I've never used it and can't attest to its efficiency and ease-of-use.

Suffix trees/arrays need to preprocess the file that will be search in O(n) to O(n log n) time (n being the length of the lookup file), and the suffix tree/array itself will be several times larger than the original file (constant factor), but there are disk-bound algorithms, and they get used for searching entire human genomes quite often (which are a few GBs). Searching for a string in the file then only takes O(m) time, where m is the length of the string, which is much faster than, say, grep (O(n log m)?). Since it seems you'll be searching the same file very many times, the investment in the preprocessing step that suffix trees/arrays require might be worth it.

user2141650
  • 2,827
  • 1
  • 15
  • 23
0

Combining some of the ideas mentioned above, I've come up with a two-pass system using grep and merging the results using join as follows:

patterns

The sun is shining!
It is a beautiful day!

lookup

Rain
Sun
Cloud
Beautiful
Is

script

grep -i -o -n -f lookup patterns > tmp1
grep -i -n -f lookup patterns > tmp2
join -t ':' -o 1.2,2.2 tmp1 tmp2 | sed -e 's/:/,/'

generates the following results

sun,The sun is shining!
is,The sun is shining!
is,It is a beautiful day!
beautiful,It is a beautiful day!

If you want an output of lookup match and pattern comma-delimited, here's a small python 2.x script that would work. It reads the lookups into a buffer, and does one pass through the patterns.

script.py

import sys, re

lookups = [re.compile(l.strip(),re.I) for l in open(sys.argv[1])]
for line in open(sys.argv[2]):
    for lookup in lookups:
        if lookup.search(line):
            print "{0},{1}".format(lookup.pattern, line),

running python script.py lookup patterns yields:

Sun,The sun is shining!
Is,The sun is shining!
Beautiful,It is a beautiful day!
Is,It is a beautiful day!
mtadd
  • 2,495
  • 15
  • 18
-1

This may not be faster, but you could try :

for i in `cat lookup_file`; 
  do  
    tmpv=`grep -i ${i} pattern_file | xargs echo ${i},`; 
    echo ${tmpv} | sed '/^$/d'; 
done
iamauser
  • 11,119
  • 5
  • 34
  • 52