19

I'm wondering why there have to be so many regular expression dialects. Why does it seem like so many languages, rather then reusing a tried and true dialect, seem bent on writing their own.

Like these.

I mean, I understand that some of these do have very different backends. But shouldn't that be abstracted from the programmer?

I'm more referring to the odd but small differences, like where parentheses have to be escaped in one language, but are literals in another. Or where meta-characters mean somewhat different things.

Is there any particular reason we can't have some sort of universal dialect for regular expressions? I would think it would make things much easier for programmers who have to work in multiple languages.

BigBeagle
  • 695
  • 1
  • 9
  • 18
  • I dunno, maybe the developers of each dialect thought theirs was better than all the others, or maybe it suited a specific need at the time that other didn't yet suport, and then when others decided to implement those features they thought they could do it better. It's not as though there's a Central Regex Governing Committee. – FrustratedWithFormsDesigner Feb 19 '10 at 16:49
  • 2
    Wouldn't that be what Posix is supposed to be :-)? – BigBeagle Feb 19 '10 at 16:54
  • 1
    http://stackoverflow.com/a/11857890/874188 has a bit of historical background if that's what you are after. – tripleee Aug 26 '16 at 02:57

4 Answers4

14

Because regular expressions only have three operations:

  • Concatenation
  • Union |
  • Kleene closure *

Everything else is an extension or syntactic sugar, and so has no source for standardization. Things like capturing groups, backreferences, character classes, cardinality operations, etc are all additions to the original definition of regular expressions.

Some of these extensions make "regular expressions" no longer regular at all. They are able to decide non-regular languages because of these extras, but we still call them regular expressions regardless.

As people add more extensions, they will often try to use other, common variations of regular expressions. That's why nearly every dialect uses X+ to mean "one or many Xs", which itself is just a shortcut for writing XX*.

But when new features get added, there's no basis for standardization, so someone has to make something up. If more than one group of designers come up with similar ideas at around the same time, they'll have different dialects.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • "No basis for standardization" is not true; POSIX specifies many of these facilities, though however unfortunately, they specify two different variants, and on top of that, many implementations deviate from or extend what POSIX codifies. The current _de facto_ standard seems to be roughly Perl/PCRE, which extends POSIX ERE, though again, there is no formal standard, and the available implementations differ in some details. – tripleee May 08 '23 at 05:22
  • @tripleee: It has not been written that specifying dialects would not be possible (or that would have never been done), but the answer explains why all these different dialects are specified differently (all of those are formal at least, so they are specified, and if only implementation specified). IMHO this has been asked for and I found the answer quite insightful as it makes sense. And yes, POSIX has specification for BRE and ERE. But then it is not that POSIX sed supports them both, only which one (by the specs) while POSIX grep does (but probably _only_ by the specs). Or the inverse? – hakre Jul 04 '23 at 18:01
  • @hakre I'm not sure I understand your comment. There are waves of innovation where you have to deviate from existing specs to come up with something new, and periods of consolidation where it makes sense to try to articulate common ground such as a new standard before the next iteration of evolution. I don't think we are there yet for the PCRE-derived dialects. `sed` landed in a weird kind of limbo evolutionarily when POSIX only specified BRE semantics for it (though _de facto_ most `sed` implementations support ERE too with `-E` or `-r`, and some support Perl shorthands like `\s` and `\d`). – tripleee Jul 05 '23 at 13:29
  • 1
    okay, let me try, because I also must admit I had problems to formulate it as this is very abstract. given - as outlined - there are only those three constructs necessary to fully describe regular expressions, nothing more is needed. however as powerful as that is, it is easy to add more that brings more to the table. call it luxury or comfort. we want that as the tooling is improved as well. but note this creates dialects of various sorts as many are doing it. but no-one is scientifically required to follow one way here. therefore dialects with little standardization. no blame btw.. – hakre Jul 06 '23 at 01:03
3

For the same reason we have so many languages. Some people will be trying to improve their tools and at the same time others will be resistant to change. C/C++/Java/C# anyone?

Kelly S. French
  • 12,198
  • 10
  • 63
  • 93
  • 1
    When someone says "C" I know they are not saying "Java". When someone says "This editor understands regex", it's like saying "This computer understands programming language." It would be helpful to know which programming language or languages. Regex has become, for better or worse, a generic term. – Ubuntourist Feb 10 '20 at 18:33
  • 1
    @Ubuntourist I'm in agreement with you, the difficulty would be in coming up with commonly accepted labels for specific variants and rule sets which itself would be cumbersome, hence my answer about it devolving into a 'tower of babble' situation. – Kelly S. French Feb 11 '20 at 04:18
2

The "I made it better" syndrome of programming produces all these things. It's the same with standards. People try to make the next "best" standard to replace all the others and it just becomes something else we all have to learn/design for.

wheaties
  • 35,646
  • 15
  • 94
  • 131
2

I think a good part of this is the question of who would be responsible for setting and maintaining the standard syntax and ensuring compatibility accross differing environments.

Also, if a regex must itself be parsed inside an interpreter/compiler with its own unique rules regarding string manipulation, then this can cause a need for doing things differently with regard to escapes and literals.

A good strategy is to take time to understand how regex algorithms themselves function at a more abstract level; then implementing any particular syntax becomes much easier. Similar to how each programming language has its own syntax for constructs like conditional statements and loops, but still accomplish the same abstract task.

tripleee
  • 175,061
  • 34
  • 275
  • 318
hqrsie
  • 413
  • 5
  • 10