1

In this wiki page: https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS#The_problematic_Regex_na.C3.AFve_algorithm

You can see that for the regex ^(a+)+$

the following NFA is created:

enter image description here

My question is: what is the process for creating this NFA from the regex?

Ogen
  • 6,499
  • 7
  • 58
  • 124

1 Answers1

3

They don't.

Most regular expression engines do not create an NFA, instead they work by backtracking.
Backtracking is a process of re-visiting every decision you've made, undoing it, and trying another option until something works, which quickly becomes inefficient. Engines are usually not smart enough to eliminate similar paths, and performance of certain patterns can become exponential.

A good explanation can be found in Runaway Regular Expressions: Catastrophic Backtracking.
In the example of ^(a+)+$ with the input aaaaaaaaaaaaX, the engine will try all combinations:

  • aaaaaaaaaaaa
  • aaaaaaaaaaa a
  • aaaaaaaaaa aa
  • aaaaaaaaaa a a
  • ...
  • aa aaaaaaaaaa
  • a aaaaaaaaaaa
  • ...
  • a a aaaa aaaa aa

... which is a lot.

The OWASP article you've linked to is demonstrating a real denial-of-service attack, but the explanation is purely theoretical - this is not how the attack actually works.

Matching regular expressions could be implemented better: A canonical article on the subject is Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) which also contains the answer to your question an explains how to build a proper NFA and match patterns efficiently.

One engine that took this approach is RE2, the regular expression library of . RE2 does build a NFA, and its performance is much better in general, but there are trade-offs: RE2 does not have common features like lookarounds and back-references.
As an example, RE2 can quickly discover there is no match on ^(((a+)+)+)+$, while other flavors time out.

See also:

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • I'll be leaving for Pizza with the family, I'll come back in a few hours if there are any questions. Also, I have 99,903 reputation points. Just saying. – Kobi Mar 27 '18 at 15:49
  • I think my title was misleading sorry. It wasn't as specific as the regex engine creating the NFA. I just wanted to know how NFAs are generated for regular expressions. But ill accept this answer because it answers my question anyway. – Ogen Mar 27 '18 at 20:55