27

Here on SO people sometimes say something like "you cannot parse X with regular expressions, because X is not a regular language". From my understanding however, modern regular expressions engines can match more than just regular languages in Chomsky's sense. My questions:

given a regular expression engine that supports

  • backreferences
  • lookaround assertions of unlimited width
  • recursion, like (?R)

what kind of languages can it parse? Can it parse any context-free language, and if not, what would be the counterexample?

(To be precise, by "parse" I mean "build a single regular expression that would accept all strings generated by the grammar X and reject all other strings").

Add.: I'm particularly interested to see an example of a context-free language that modern regex engines (Perl, Net, python regex module) would be unable to parse.

georg
  • 211,518
  • 52
  • 313
  • 390
  • The thing with regex is that, it can be very precise or very loose, but hard to make it behave "just right". This is the case with street HTML, where there are invalid open or close tag. – nhahtdh Jul 03 '12 at 07:57
  • 6
    This may be better of on [cs.SE]. By the way, regexps are no grammars; different formalism. – Raphael Jul 03 '12 at 14:29
  • 3
    A recent article on the subject is: [The true power of regular expressions](http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html) - It's an interesting read, and I think it answers your questions with good examples. – Kobi Jul 07 '12 at 12:37
  • @Kobi: Bingo! That post is exactly what I was looking for. Can you make your comment an answer so I can accept it? – georg Jul 08 '12 at 05:00

3 Answers3

20

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

  • Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
  • Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
  • If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don't know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
  • Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

  • Matching the context-free language {a^n b^n, n>0}:

    /^(a(?1)?b)$/
    # or
    /^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x
    
  • Matching the context-sensitive language {a^n b^n c^n, n>0}:

    /^
        (?=(a(?-1)?b)c)
        a+(b(?-1)?c)
    $/x
    # or
    /^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x
    
NikiC
  • 100,734
  • 37
  • 191
  • 225
  • 1
    Thanks! This is what I was looking for. [regex](http://pypi.python.org/pypi/regex/) module for python supports lookbehinds with groups and of unlimited length. – georg Jul 08 '12 at 11:20
  • I think that a distinction should be made between accepting (recognizing) and parsing. IMHO, parsing (from Latin pars, part) should mean to resolve into all component parts, i.e., to make them all available (e.g., in a parse tree). This is something that no regex engine (that I know, at least) is able to do - or am I wrong? – Walter Tross Jul 08 '12 at 11:44
  • @WalterTross Yes, you are right. I replaced "parse" with "match" in my answer :) – NikiC Jul 08 '12 at 12:19
  • @thg435 That looks interesting. Has a feature set similar to PCRE, but the variable-width lookbehind assertions are really something I haven't see before. Very nice! – NikiC Jul 08 '12 at 12:23
  • .NET has no limit on backreferences and supports subpattern references. It is the only flavor with this capability that I am aware of. – JDB Feb 06 '13 at 13:38
  • Great, great answer! But you should probably mention that this example need PCRE (http://www.pcre.org/), so one should use pcregrep and it does not work for example with grep or egrep on the command line. – leo Mar 22 '13 at 09:30
  • Accord to [these answers](http://stackoverflow.com/questions/2974210) lookarounds do not add any power to the language. – BlueRaja - Danny Pflughoeft Jun 13 '13 at 21:48
  • 1
    @BlueRaja These answers are written under the assumption that you are adding *only* lookahead to a regular language. They do not cover whether lookahead makes the language more powerful if it already supports subpattern references. I'm pretty sure it does, because the ability to inspect context is what distinguishes context-free and context-sensitive languages. – NikiC Jun 14 '13 at 08:13
  • Ah, ok thanks, +1 to you. Also, to anyone wondering what the difference between a "backreference" `\1` and a "subpattern reference" `(?1)` is *(Most sites seem to mix the two terms)*, see [here](https://developer.gnome.org/glib/2.28/glib-regex-syntax.html#id394261). – BlueRaja - Danny Pflughoeft Jun 14 '13 at 16:32
9

Modern regex engines can certainly parse a bigger set of languages than the regular languages set. So said, none of the four classic Chomsky sets are exactly recognized by regexes. All regular languages are clearly recognized by regexes. There are some classic context-free languages that cannot be recognized by regexes, such as the balanced parenthesis language a^n b^n, unless backreferences with counting are available. However, a regex can parse the language ww which is context-sensitive.

Actually, regular expressions in formal language theory are only lightly related to regexes. Matching regexes with unlimited backreference is NP-Complete in the most general case, so all pattern matching algorithms for powerful enough regexes are exponential, at least in the general case. However most times for most input they are quite fast. It is known that matching context-free languages is at most something faster than n^3, so there are some languages in regexes that are not context-free (like ww) but not all context-free languages can be parsed by regexes. Type 0 languages are non-decidable in general, son regexes don't get there.

So as a not very conclusive conclusion, regexes can parse a broad set of languages that include all regular languages, and some context-free and context-sensitive, but it is not exactly equal to any of those sets. There are other categories of languages, and other taxonomies, where you could find a more precise answer, but no taxonomy that includes context-free languages as a proper subset in a hierarchy of languages can provide a single language exactly recognized by regexes, because regexes only intersect in some part with context-free languages, and neither is a proper subset of the other.

Alejandro Piad
  • 1,827
  • 1
  • 16
  • 23
  • 1
    Thanks for the answer! An engine with recursion can parse `a^n b^n`: `^(|a(?1)b)$`. Can you give an example of CFG that regex cannot handle? Also, what do you mean by `ww`? – georg Jul 05 '12 at 21:07
  • @thg435, by `ww` he probably meant two identical chars, which a modern regex implementation can match like this: `(.)\1` (as you probably are aware of, looking at your regex above :)) – Bart Kiers Jul 06 '12 at 08:32
  • @BartKiers or rather two identical words: `(.+)\1` – Walter Tross Jul 07 '12 at 14:04
  • `ww` means two identical strings, exactly as @WalterTross said. Sorry for the mistake with `a^n b^n`, I will edit to correct it. – Alejandro Piad Jul 10 '12 at 13:55
2

You can read about regexes in An Introduction to Language And Linguistics By Ralph W. Fasold, Jeff Connor-Linton P.477

Chomsky Hierarchy:

Type0 >= Type1 >= Type2 >= Type3

Computational Linguistics mainly features Type 2 & 3 Grammars

Type 3 grammars:

–Include regular expressions and finite state automata (aka, finite state machines)

–The focal point of the rest of this talk

Type 2 grammars:

–Commonly used for natural language parsers

–Used to model syntactic structure in many linguistic theories (often supplemented by other mechanisms)

–We will play a key roll in the next talk on parsing.


most XMLs like Microsoft DGML (Directed Graph Markup Language) that has inter-relational links are samples that Regex are useless.


and this three answers may be useful:

1 - does-lookaround-affect-which-languages-can-be-matched-by-regular-expressions

2 - regular-expressions-arent

3 - where-do-most-regex-implementations-fall-on-the-complexity-scale

Community
  • 1
  • 1
Ria
  • 10,237
  • 3
  • 33
  • 60
  • XML or [Microsoft DGML](http://en.wikipedia.org/wiki/DGML) (Directed Graph Markup Language) are samples that Regex are useless. – Ria Jul 06 '12 at 10:14