4

I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:

\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b

etc.

My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?

jp.
  • 106
  • 5
  • is 'thousands of regexes' in the realm of [could be keeping the internal state of 'thousands of regex engines' in memory at once]? – rsaxvc Jan 02 '12 at 05:55
  • 1
    Have you tried `egrep -f filewithpatterns filetosearchthrough`? – Marcin Jan 02 '12 at 08:36
  • @rsaxvc I am not too concerned about memory consumption and could spend up to 1GB of RAM on it. – jp. Jan 02 '12 at 13:35
  • @Marcin I believe that egrep -f is just Aho Corasick, it only works for plain strings. – jp. Jan 02 '12 at 13:36
  • @fge GBs=Gigabytes the point is that I could spend a lot of time compiling the expressions as it is likely to be negligible compared to matching all that text. – jp. Jan 02 '12 at 13:38
  • @jp. compiling a regex is not a "free" process, so I'm not sure what you mean. Also, are you free to use the tools you want? Say, perl, for instance? – fge Jan 02 '12 at 13:41
  • @jp , do you want to find only the longest regex possible, or all regexes that match multiple (all?) strings in your data? I'm not sure how to go about generating all regexes that match your data, but if you could do that, and then I would convert them to DFA, as there are simple rules for 'folding' DFA to their simplest possible representation, which would be nice to have to minimize both complexity of regex, and the final number of regexes you get. – Marcin Jan 02 '12 at 17:57

4 Answers4

2

You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.

[ I assume grep -F does about the same ]

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%

"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again"  ); }


"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!"  ); } 

... 
more lines
...

[bd-fh-su-z]+ {;}

[ \t\r\n]+ {;}

. {;}

%%

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

A script like the one above can be generated form txt input with awk or any other script language.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

A slightly smarter implementation than running every regex on every file:

For each regex:
    load regex into a regex engine
    assemble a list of regex engines
For each byte in the file:
    insert byte to every regex engine
      print results if there are matches

But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes

rsaxvc
  • 1,675
  • 13
  • 20
0

I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:

((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))

If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • For a few thousand regexs this can be more than 100x slower than doing Aho-Corasick. – jp. Jan 03 '12 at 02:29
0

If you need really fast implementation of some specific case, you can implement suffix tree of Aho–Corasick algorithm yourself. But in most cases union of all your regexes into single regex, as recommended earlier, will be not bad too

Sasha
  • 8,537
  • 4
  • 49
  • 76