0

I have a list of regular expressions and a string. I want to know which of the regular expressions (possibly more than one) match the string, if any. The trivial solution would be to try regular expressions one by one, but this part is performance critical... Is there a faster solution? Maybe by combining regular expressions in a single state machine somehow?

Reason: regular expressions are user-supplied filters that match incoming strings. When the message with the string arrives, system needs to perform additional user-specified actions.

There are up to 10000 regular expressions available. Regular expressions are user-supplied and can be somewhat simplified, if necessary (.* should be allowed though :) ). The list of regex-es is saved in MongoDB, but I can also prefetch them and perform the search inside Python if necessary.

Similar questions:

  • similar to this question, but the constraints are different: fuzzy matching is not enough, number of regular expressions is much lower (up to 10k)
  • similar to this question, but I can pre-process the regexes if necessary
  • similar to this question, but I need to find all matching regular expressions

I would appreciate some help.

Community
  • 1
  • 1
johndodo
  • 17,247
  • 15
  • 96
  • 113
  • Using `({}).format(')|('.join(the_regexes))` isn't acceptable? It would be an easy way to join the possible regexes. Probably the built-in engine would be able to optimize the search, however if you have to join 10000 regexes I'm pretty sure it will take a *big* amount of time to compile it. If the regexes don't change often you can store the compiled regex into the database. An other approach is to join groups of regexes of a fixed size and then apply the "naive" algorithm with the derived regexes. – Bakuriu Jan 21 '14 at 15:50
  • @Bakuriu, correct me if I'm wrong, but wouldn't using that only tell you that the input string matched one of the regexes, but not which specific regular expressions were matched? – gms7777 Jan 21 '14 at 15:55
  • 1
    @gms7777 Yes, you are right. But the problem is *not* understanding which regex matched (this can be done using the groups, eventually using named groups), but that kind of regex cannot help you if you want to find *all* regexes that would match the given text. AFAIK using built-in regexes there's simply no way to do better than the naive algorithm. You'd have to implement your own regex engine that creates a single minimum DFA and keeps track of the names of the regexes associated with the accepting state matched. – Bakuriu Jan 21 '14 at 16:01
  • @Bakuriu: the way I see it, you summed it up pretty nicely. However implementing my own regex engine is... less than ideal. Is there a library I could use? – johndodo Jan 21 '14 at 16:26
  • I find it increadibly hard to believe 10,000 flat regex's could give any information that could be corollated about a single string. No associations could be made since they're all disconnected and flat. Might as well parse each character. –  Jan 21 '14 at 16:56

1 Answers1

0

First of all, if you have > 10K regular expressions you definitely should prefetch and keep them compiled (using re.compile) in memory. As a second step, I recommend to think about parallelism. Threads in Python aren't strong, due to GIL. So use multiple processes instead. And third, I would think about scalability on number of servers, using ZeroMQ (or another MQ) for communication.

As an interesting scientific task you can try to build regexp parser, which builds trees of similar regexps:

A
|-B
|-c
  |-D
  |-E

So if regexp A is matches string, then B, C, D, E match it too. So you will be able to reduce number of checks. IMHO, this task will take so much time. Using a bunch of servers will be cheaper and faster.

Dmitry Vakhrushev
  • 1,382
  • 8
  • 12