How do Regex engines generate NFAs for regular expressions?

Question

In this wiki page: https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS#The_problematic_Regex_na.C3.AFve_algorithm

You can see that for the regex ^(a+)+$

the following NFA is created:

My question is: what is the process for creating this NFA from the regex?

Kobi · Accepted Answer · 2018-07-31T07:10:59.870

They don't.

Most regular expression engines do not create an NFA, instead they work by backtracking.
Backtracking is a process of re-visiting every decision you've made, undoing it, and trying another option until something works, which quickly becomes inefficient. Engines are usually not smart enough to eliminate similar paths, and performance of certain patterns can become exponential.

A good explanation can be found in Runaway Regular Expressions: Catastrophic Backtracking.
In the example of ^(a+)+$ with the input aaaaaaaaaaaaX, the engine will try all combinations:

aaaaaaaaaaaa
aaaaaaaaaaa a
aaaaaaaaaa aa
aaaaaaaaaa a a
...
aa aaaaaaaaaa
a aaaaaaaaaaa
...
a a aaaa aaaa aa

... which is a lot.

The OWASP article you've linked to is demonstrating a real denial-of-service attack, but the explanation is purely theoretical - this is not how the attack actually works.

Matching regular expressions could be implemented better: A canonical article on the subject is Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) which also contains the answer to your question an explains how to build a proper NFA and match patterns efficiently.

One engine that took this approach is RE2, the regular expression library of golang. RE2 does build a NFA, and its performance is much better in general, but there are trade-offs: RE2 does not have common features like lookarounds and back-references.
As an example, RE2 can quickly discover there is no match on ^(((a+)+)+)+$, while other flavors time out.

How do Regex engines generate NFAs for regular expressions?

1 Answers1