8

I'm trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.

In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.

Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.

For example, about backreferences he states

Backreferences are trivial in backtracking implementations.

and gives an idea for the DFA approach. But it's not clear to me how this can be "trivially" done with the VM approach. When the backreference command is reached, you'd have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.

He also mentions a possible optimization in repetitions by using look-aheads, but doesn't elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.

tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I'm not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.


Edit: As mentioned in the first link, the Oniguruma library implements a regex matcher with exactly that stack-based backtracking approach. Perhaps someone can explain the optimizations done by that library which can be generalized to other implementations. Unfortunately, the library doesn't seem to provide any documentation on the source code and the code also lacks comments.


Edit 2: When reading about parsing expression grammars (PEGs), I stumbled upon a paper on a Lua PEG implementation which makes use of a similar VM-based approach. The paper mentions several optimization options to reduce the number of executed VM commands and an unnecessary growth of the backtracking stack.

siracusa
  • 3,286
  • 1
  • 10
  • 19
  • 3
    I'm not sure if perl's regexp interpreter is VM-based or not, but it's certainly a hybrid NFA/DFA approach with myriads of optimizations. See [perlreguts](https://perldoc.perl.org/perlreguts.html) for details. I'm particularly fond of the [peep-hole optimisation](https://perldoc.perl.org/perlreguts.html#Peep-hole-Optimisation-and-Analysis), which uses [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) for faster search on the longest static string and then scans around it (I conceived that independently only to have a coworker send me that link!) – Adam Katz Aug 07 '19 at 17:27
  • @AdamKatz Thanks for that link. It already has some useful hints on possible optimizations in it (like the TRIE construction for many fixed string alternatives which is very relevant for my usecase). I didn't find the time to study `regcomp.c` but at a glance it includes a lot of explaining comments. – siracusa Aug 08 '19 at 06:46
  • 1
    Interesting. Are you working on a new RE implementation? I think the golang re2 module was modeled after Cox' recommendations. So, might be worth looking at the re2 implementation. – wp78de Aug 09 '19 at 23:21
  • @wp78de Yes, but in a language that probably isn't of much interest for the general public (TeX/LaTeX). The only available implementation is quite heavy, very slow for larger regexes or when used in a loop, and has some other flaws. So I'm trying to write a more lightweight version which however has some limitations on the type of input. As (La)TeX is an interpreted language, doing as much optimization as possible is crucial for speed. – siracusa Aug 09 '19 at 23:27

1 Answers1

2

I suggest you to watch full lection, it is very interesting, but here is outline:

  • Complexity explosion in backtracking. This happens then pattern have ambiguity ([a-x]*[a-x0-9]*z in video, as an example) in it, so engine have to backtrack and test all conditions, until it become certain the pattern did (or didn't) match.

    It can take up to O(Nᵖ), where p is "measure of ambiguity" of pattern. To get O(pN), we need to avoid evaluating equivalent threads again and again.

    ...

    Solution: At one step ajust all threads by one character, "Breadth-first" execution results in linear comlexity.

  • Tricks to save every bit of performance

  • Inside std::regex

Hope this helps!

P.S Lector's repository

  • Thanks for the video. I haven't had the time to look into all the methods he mentions in the video yet, e.g. Shift-Or sound quite interesting too. Could you elaborate how the optimiziation you mentioned works exactly? I understand from the video that they change the execution order of the threads, but shouldn't that give the same amount of backtracks, just in a different order? – siracusa Aug 12 '19 at 22:43
  • @siracusa in this approach we eliminate repetitions of the same operations. Think of it as execution tree. What is closer to the top is "approved" and you don't need to re-evaluate this operations. So you can just pick up data you had on previous states and use it. You can see the difference on this [timecode](https://youtu.be/RVri2-VJvxI?t=596). In conclusion: approach he suggests, we can spawn 3 threads: first considering `[a-x]*` succeded, second considering `[a-x0-9]` succeded but first fail and third considering only `z` will succeed. – nonForgivingJesus Aug 13 '19 at 07:06