Why does Regexp have a timeout method, while in theory they shouldn't?

Question

This is a theoretical Computer Science question (Computation Theory).

I know that RegExps can take a very long time to calculate. However, from Theory of Computation we know that matching with a Regular Expression can be done extremely fast in a few clock cycles.

If RegExps are equivalent to Finite Automata, why RegExps have (or require) a timeout method? Using a DFA, the computation time for matching can be exteremely fast.

By RegExps I mean the Regular Expressions pattern matching classes in major languages; JavaScript, C#, etc.

Are common RegExps ("regex"s) not equivalent to the Regular Expressions in Theory of Automata (i.e. Regular Languages)?

For examples see: How do I timeout Regex operations to prevent hanging in .NET 4.5? and Regex Pattern Catastrophic backtracking .

If Regexp's matching require Backtracking, it means they are not equivalent to Regular Expressions.

If the languages captured by "Regexp"s are not Regular Languages, historically why (out of which necessity) were they extended?

If it that the resulting DFA will require a huge set of states?

Normally matching is fast, but there are some cases with certain regular expressions and long input that can cause it to be very slow. — Barmar, Aug 30 '19 at 18:55
This is not a duplicate of that question. This is a theoretical question. — Sohail Si, Aug 30 '19 at 19:01
Then it doesn't belong on [so], which is about practical programming problems. Theoretical questions belong on [cs.se]. Also, the mathematics of regular languages belongs in [math.se]. — Barmar, Aug 30 '19 at 19:03
RegExp objects have more features than a DFA. All a DFA can do is accept/reject. But RegExp objects also report captures. Recovering captures requires more work. — Raymond Chen, Aug 30 '19 at 19:04
Recovering captures should not take that long one a match (accept/reject) is done. — Sohail Si, Aug 30 '19 at 19:07
I suspect two possible reasons: 1. because the string can be any substring, i.e. when not surrounded `/^` and `$/` . 2. Because the state space (number of the states of the resulting DFA) can easily grow extremely large. — Sohail Si, Aug 30 '19 at 19:08
@Barmar OK but theoreticians may not know about details of common (Perl,etc) RegExp classes. — Sohail Si, Aug 30 '19 at 19:10
This question is related: https://stackoverflow.com/questions/8132412/which-regular-expression-requires-backtracking?rq=1 However, it is not asking this question (about distinction of classes). — Sohail Si, Aug 30 '19 at 19:10
Your question isn't about Perl, you tagged it `nsregularexpression`, which is from the Apple library used for Objective C and Swift. — Barmar, Aug 30 '19 at 19:13
@Barmar Please re-read the question and unmark it. It has nothing to do with the redirected question (the example can be useful but it is asking a different question). — Sohail Si, Aug 30 '19 at 19:14
In fact, I don't think timeouts are common to most regular expression implementations. AFAIK there's no timeout in PHP or Python. — Barmar, Aug 30 '19 at 19:14
But the point remains, the reason you need a timeout is because some regular expressions cause catastrophic backtracking. — Barmar, Aug 30 '19 at 19:14
Also, modern regular expressions like PCRE are not equivalent to DFAs. — Barmar, Aug 30 '19 at 19:15

score 0 · Answer 1 · answered Aug 30 '19 at 22:11

0

Because regex are not equivalent to the Regular Expressions in Theory of Automata.

They are more like cousins with extra functionalities that make them more complex and sometimes (depending on the regex) impossible to execute on long strings.

answered Aug 30 '19 at 22:11

YOGO

531
4
5

This is the sort of answer I need. However, is the above statement correct? – Sohail Si Sep 06 '19 at 11:02
But what particular extra functionalities make them non-RE (as in Theory of Automata)? – Sohail Si Sep 06 '19 at 11:03
Without that, the answer is incomplete. – Sohail Si Sep 06 '19 at 11:15
1

The correct answer take a CS paper or book. There are certain regex that can be converted, but for example (A*)B\1 till I know can not be converted to a Regular Expression in theory of automata. Check this short text : https://www.rexegg.com/regex-vs-regular-expression.html – YOGO Sep 06 '19 at 11:33
https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages – YOGO Sep 06 '19 at 11:38

Bohemian · Accepted Answer · 2019-08-31T03:50:31.590

0

A good reason is catastrophic backtracking, which explains why matching of some regexes will not return before the heat death of the universe.

edited Aug 31 '19 at 03:50

answered Aug 30 '19 at 23:08

Bohemian

412,405
93
575
722

This link seems to explain the reason that is the answer to my question. However, your answer does not mention it. – Sohail Si Sep 06 '19 at 11:08
I know (and it was mentioned in the question that) it can be computationally expensive. I wanted to know why, i.e. which feature of the `RegExp`s cause this, despite claiming be REs, i.e. despite `RegExp` sharing its name with Regular Expressions in classic Theory of Computation. – Sohail Si Sep 06 '19 at 11:14

score -1 · Answer 3 · answered Aug 30 '19 at 22:54

(out of which necessity) were they extended?

Regexp implementations were extended in systems in which the lack of a regexp feature requires difficult workarounds, such as writing a substantial amount of code in an inexpressive programming language. There is also the grave risk that the code might turn out to be correct, performant and robust against false positive matches.

Why does Regexp have a timeout method, while in theory they shouldn't?

3 Answers3

Linked