3

I want to match URLs with a common prefix (/files) but only two specific directories - images and videos - under it. I can think of two regular expressions for this:

/files/(images/.*|videos/.*)

and

/files/(image|video)s/.*

I have two questions here:

  1. Which one is better from a performance perspective? My guess is the second one since its DFA will have a smaller number of states.
  2. Is there a general purpose programming language whose built-in regex compiler will reduce the given regex into the minimal DFA?

The performance matters to me since I'll be using it to match billions of string. So, any tiny bit of improvement also matters to me.

Chandranshu
  • 3,669
  • 3
  • 20
  • 37
  • @nhahtdh - Thanks for mentioning this point. I am aware of the difference and the fact. I only mentioned DFA since a NFA is at least theoretically equivalent to some DFA. – Chandranshu Feb 06 '14 at 10:34

3 Answers3

2

Which one is better from a performance perspective? My guess is the second one since its DFA will have a smaller number of states.

Both of the expressions have the same number of states in the minimal DFA, and their DFAs matches the same "language" (in theoretical sense).

Regardless of the number of states in the DFA, the performance is theoretically the same, since you will traverse through the states deterministically as you feed the input to the automaton.

In practice, there might be difference due to cache miss, which may happen more often when there are more states. However, unless you are working on the regex engine, I can't think of any good reason to spend time doing black-box optimization.

Is there a general purpose programming language whose built-in regex compiler will reduce the given regex into the minimal DFA?

Go (re2) and Haskell has engines that converts regex into DFA. I don't know whether the DFA is minimal or not, though.

Since POSIX ERE doesn't support back-reference (back-reference is different from reference to capturing group in replacement), it is possible to write an engine for POSIX ERE that runs on DFA or efficient NFA-simulation. However, since the standard doesn't mandate such implementation as long as the result is correct (match the left-most longest string), the implementation can exhaustively search for all strings that matches the regex on an NFA backtracking engine.

However, at least GNU egrep seems to implements POSIX ERE with DFA (based on the file name dfa.c).


For your information, there are 3 approaches to implement a regular expression matching:

  • DFA
  • NFA-simulation algorithm
  • NFA backtracking

For more details, this article (cited in this question) explains:

  • For (theoretical) regular expression, there exists efficient NFA simulation algorithm with sub-match tracking (capturing group).
  • Why backtracking engine is so prominent (e.g. Java, Python, Perl2, ...)
    2: Perl implements a backtracking engine with memoization.
  • Backtracking engine may take exponential time on the length of input, while Thompson's NFA simulation algorithm takes O(mn) time where m is the length of the regular expression and n is the length of input.
  • The most efficient algorithm to match (ir)regular expression with backreference currently known is backtracking approach. Therefore, some engines decide not to support backreference to increase efficiency in matching.
  • Backtracking engines (even with memoization) are slower than Thompson's NFA simulation algorithm.

By the way, re2 engine (mentioned above) includes implementation of DFA-based and NFA-based (efficient simulation) matching algorithm.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

Python 2.7 says:

import timeit
once = 'import re; m="/files/images/test"'
num = 1000000
print timeit.timeit(stmt='re.findall(r"/files/(images/.*|videos/.*)", m)', setup=once, number=num)
-> 1.5884420871734619
print timeit.timeit(stmt='re.findall(r"/files/(image|video)s/.*", m)', setup=once, number=num)
-> 1.5990869998931885

this uses the regex 1 million times and after running both lines multiple times both have the same speed.

Python might cache the compiled regex...

I tested it with

  • /files/images/test
  • /files/videos/test
  • /files/viddeos/test

your first version (/files/(images/.*|videos/.*)) runs a bit (0.1 second) faster in my tests

  • Thanks for your test cases! Would you happen to know why the first one ran faster? Does python internall reduces it to the minimal DFA? – Chandranshu Feb 06 '14 at 08:04
0

My gut (not too helpful, I know) would say the second option is the better one. But consider this:

  • Why do you want to add .*?
  • Can you anchor the regular expression to the start of the line?
  • Do you need to capture the group?

If those apply, I'd go for this:

^/files/(?:image|video)s/
mrjink
  • 1,131
  • 1
  • 17
  • 28
  • I don't have control over where this expression is used. They require the .* to be present. The regex didn't need to be anchored to the start of the line). And I don't need to capture the group. And I don't know how this answers my question :) You could've asked these clarifications as comments. – Chandranshu Feb 06 '14 at 10:27
  • The answer was "my gut tells me the second one is better". The rest could've been a comment, sure. :) – mrjink Feb 06 '14 at 10:46