15

Looking at some of the regex questions commonly asked on SO, it seems to me there's a number of areas where the traditional regex syntax is falling short of the kind of tasks people are looking for it to do nowadays. For instance:

  • I want to match a number between 1 and 31, how do I do that ?

The usual answer is don't use regex for this, use normal conditional comparisons. That's fine if you've got just the number by itself, but not so great when you want to match the number as part of a longer string. Why can't we write something like \d{1~31}, and either modify the regex to do some form of counting or have the regex engine internally translate it into [1-9]|[12]\d|3[01] ?

  • How do I match an even/odd number of occurrences of a specific string ?

This results in a very messy regex, it would be great to be able to just do (mytext){Odd}.

  • How do I parse XML with regex ?

We all know that's a bad idea, but this and similar tasks would be easier if the [^ ] operator wasn't limited to just a single character. It'd be nice to be able to do <name>(.*)[^(</name>)]

  • How do I validate an email with regex ?

Very commonly done and yet very complex to do correctly with regex. It'd save everyone having to re-invent the wheel if a syntax like {IsEmail} could be used instead.


I'm sure there are others that would be useful too. I don't know too much about regex internals to know how easy these would be too implement, or if it would even be possible. Implementing some form of counting (to solve the first two problems) may mean it's not technically a 'regular expression' anymore, but it sure would be useful.

Is a 'regex 2.0' syntax desirable, technically possible, and is there anyone working on anything like this ?

Community
  • 1
  • 1
Michael Low
  • 24,276
  • 16
  • 82
  • 119
  • 7
    Regexes are the **lowest common denominator** of text matching, and they're good at that. Anything more specific would be better handled by DSLs than the general purpose tool that is Regexes. Yes, it'd be nice to add a kitchen sink to Regexes, but they're already complex enough. Tempted to vote for *subjective and argumentative*. – deceze Jan 27 '11 at 07:38
  • 5
    It's called a lexer and parser. There are plenty generators on the net. They have been around probably just as long as Regex. – leppie Jan 27 '11 at 07:38
  • 1
    great question, miket2e. on of the "yeah, exactly, whats up with that?" questions. unfortunately i have no answer ready but i'm curious for your results – kostja Jan 27 '11 at 07:44
  • 2
    @leppie, @deceze looking at this from the perspective of a user of regexes io looking at the technical difficulties involved adding the functionality, it would be handy (and probably much (ab)used). Using a lexer and parser often still requires you to use regexes. Now you have to know two things. – Lieven Keersmaekers Jan 27 '11 at 07:52
  • 2
    @Leiven: To build a house, you need to learn to use multiple specialized tools... There is nothing wrong with that. A specialized tool with always do the job better then a generic tool. – Andrew Moore Jan 27 '11 at 08:12
  • When all you have is a hammer, everything looks like a nail. This isn't really a question. – Fenton Jan 27 '11 at 10:45
  • 3
    The e-mail one is something which ought to be built into the standard networking API of every new language which has one. It's the kind of wheel which people shouldn't need to think about inventing. – Peter Taylor Jan 27 '11 at 10:47
  • Maybe you should try it on http://programmers.stackexchange.com -- more suited for this kind of question. – Felix Dombek Jan 27 '11 at 11:38
  • @Andrew - following that logic we would still have a seperate scanner, copier and fax. Instead we have multifunctionals. The question is where to draw the line in making things easier. – Lieven Keersmaekers Jan 27 '11 at 15:39
  • @Sohnee - The problem is that the nail in this example is a string. When people stop using strings, developers will stop trying to hammer them. – CurtainDog Jan 27 '11 at 23:47
  • @CurtainDog - RegEx isn't *the* tool for strings. It is *a* tool for strings. – Fenton Jan 28 '11 at 09:53
  • @Lieven: The multifunction of string parsing is a lexer-generator and parser-generator that were built to go together. – dmckee --- ex-moderator kitten Jan 29 '11 at 00:04

5 Answers5

16

I believe Larry Wall covered this with Perl 6 regexes. The basic idea is to replace simple regular expressions with more-useful grammar rules. They're easier to read and it's easier to put code in for things like making sure that you have an number of matches. Plus, you can name rules like IsEmail. I can't possibly list all the details here, but suffice it to say, it sounds like what you're suggesting.

Here are some examples from http://dev.perl.org/perl6/doc/design/exe/E05.html:

Matching IP address:

token quad {  (\d**1..3) <?{ $1 < 256 }>  }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;

Matching nested parentheses:

$str =~ m/ \(  [ <-[()]> + : | <self> ]*  \) /;

Annotated:

    $str =~ m/ <'('>                # Match a literal '('
               [                    # Start a non-capturing group
                    <-[()]> +       #    Match a non-paren (repeatedly)
                    :               #    ...and never backtrack that match
               |                    # Or
                    <self>          #    Recursively match entire pattern
               ]*                   # Close group and match repeatedly
               <')'>                # Match a literal ')'
             /;
Kobi
  • 135,331
  • 41
  • 252
  • 292
Gabe
  • 84,912
  • 12
  • 139
  • 238
  • Was just writing about the same :-) – kriss Jan 27 '11 at 07:43
  • If you want grammars, well, Prolog had DCG's (definite clause grammars) built-in all the time, and they are strictly more powerful than context-sensitive languages (they should be something like indexed grammars, afaik)! Just that a language-specific solution is not a general solution. – Felix Dombek Jan 27 '11 at 08:01
  • Felix: No, DCGs are overkill for what the OP is looking for. – Gabe Jan 27 '11 at 08:09
  • @Gabe: What the OP is looking for is a Swiss Knife™ for all his string needs. What's he forgetting however is that a Swiss Knife™ is compromised of multiple specialized tools. For XML, HTML and EMAILs, parsers exists. For integers, well, language constructs. – Andrew Moore Jan 27 '11 at 08:28
  • 2
    @Andrew: Did you read the question? He's not trying to find a tool to write every possible parser. He just wants to extend a very popular tool to do tasks that it can *nearly* do. For example, a standard drill is useless as a screwdriver because it only goes forward and at high speed. Once you make it reversible and variable speed, it's a great screwdriver (and nutdriver, etc.). The OP just wants to add a few features to a great tool to make it even better -- or just create a new tool that can do the same. – Gabe Jan 27 '11 at 10:18
  • @Gabe: "No"? I was just saying that (1) this is language specific and therefore not general, (2) there are more powerful and still easy-to-use parsing methods, even built-in, in some languages. I might also have mentioned SNOWBALL and SNOBOL. Also, i disagree with the notion of "Overkill" for DCG's: they are about as easily written and more readable (IMHO) as your annotated Perl 6 regex. – Felix Dombek Jan 27 '11 at 11:43
  • Looks a bit like BNF meets Regex. Interesting approach. – Michael Stum Jan 28 '11 at 03:44
  • Although this is language-specific, many languages current regex implementation is based on Perl's. Who knows, the same might happen with these new Perl 6 rules. – Michael Low Jan 28 '11 at 06:07
16

Don't blame the tool, blame the user.

Regular Expressions were built for matching patterns in strings. That's it.

It was not made for:

  • Integer validation
  • Markup language parsing
  • Very complex validation (ie.: RFC 2822)
  • Exact string comparison
  • Spelling correction
  • Vector computation
  • Genetic decoding
  • Miracle making
  • Baby saving
  • Finance administering
  • Sub-atomic partitioning
  • Flux capacitor activating
  • Warp core engaging
  • Time traveling
  • Headache inducing
    Never-mind that last one. It seems that regular expressions are very well adapted to doing that last task when they are being used where they shouldn't.

Should we redesign the screwdriver because it can't nail? NO, use a hammer.

Simply use the proper tool for the task. Stop using regular expressions for tasks which they don't qualify for.

  • I want to match a number between 1 and 31, how do I do that?
    Use your language constructs to try to convert the string to an integer and do the appropriate comparisons.

  • How do I match an even/odd number of occurrences of a specific string?
    Regular expressions are not a string parser. You can however extract the relevant part with a regular expression if you only need to parse a sub-section of the original string.

  • How do I parse XML with regex?
    You don't. Use a XML or a HTML parser depending on your need. Also, an XML parser can't do the job of an HTML parser (unless you have a perfectly formed XHTML document) and the reverse is also true.

  • How do I validate an email with regex?
    You either use this large abomination or you do it properly with a parser.

Andrew Moore
  • 93,497
  • 30
  • 163
  • 175
  • 1
    I see your point, but I'm not proposing they be completely redesigned. It just seems like there's scope to improve them in a few ways, to match the things people are trying to do with them nowadays. – Michael Low Jan 27 '11 at 07:52
  • @miket2e: The problem is that it shouldn't be used to do those things in the first place. It never was the right tool, and there are plenty of tools which are built exactly to do those tasks. Regular expressions are right now extremely good at what they are supposed to do. **A fork is a fork. A spoon is a spoon. You may thing you want a spork but it ain't as useful as it seems.** I definitively don't want regexes to become a spork. – Andrew Moore Jan 27 '11 at 07:57
  • He's not saying "turn regexes into a spork"; he's saying "let's build a new tool to replace regexes, which can do things regexes aren't good for". – Gabe Jan 27 '11 at 08:05
  • @Gabe: I understand that... But they are already tools out there to do that kind of work. Regex is not to blame for not solving problems it wasn't meant to solve in the first place. It doesn't need replacing. People need to stop using them as an end-to-all-means solution for string data extraction. That's not why they exists. **Use specialized parsers and lexers; they were made for that kind of work.** – Andrew Moore Jan 27 '11 at 08:09
  • 2
    The point is that we often find ourselves in situations where we need to do things like validate input ("Is this a valid email?", "Is this age at least 13?") and you can't stick a YACC grammar into a line in an Apache config or a `VARCHAR(100)` field in a database. It doesn't seem unreasonable to make a better tool for this kind of task. – Gabe Jan 27 '11 at 08:34
  • 3
    @Gabe: Ultimately, the logic of your validation really shouldn't be stored in an Apache config or in your database. If you do have a system with variable input, store parameters that your can parse for which your code will send to the proper validator after. For example `"int|required|min:13"` for your age validation and `"email|required"` for your email. If you do have something that has a specific format and can be easily validated using a regular expression, then by all means: `"regex|pattern:/^REF[0-9]{5,8}$/i"` but beyond that, don't. – Andrew Moore Jan 27 '11 at 08:44
  • 1
    I think this is the right answer. Extensions such as those suggested by the OP will only encourage people to keep using the wrong tool. – Richard H Jan 27 '11 at 08:59
  • 1
    So you're suggesting that when I need to create a URL rewriting rule (which must reside in my httpd.conf) that is too complex for a regex, my only options should be to change my requirements or write my own Apache module? I don't see why using a more powerful matching language shouldn't be a valid option. – Gabe Jan 27 '11 at 09:42
  • @Gabe: You were talking about validation, now you are talking about URL rewriting... Two different things. URL rewriting uses patterns to redirect/rewrite the request. What do you know, regular expressions were made for pattern matching, so please store away in your Apache Config. *"Is this a valid email?"*, keep elsewhere. If the rewrite depends on the validation, grab the common denominator, validate in your controller. – Andrew Moore Jan 27 '11 at 09:47
  • 4
    Don't take this the wrong way, but this answer seems very emotional, and I think most of it is a rant. Part of our work is building better tools: the OP is inquiring about such tools, and wants to discuss their possible usefulness. In fact he is correct: many regex flavors do offer extensions that enable stronger abilities, making them a fitting tool for more tasks (not to mention code callouts, which are strictly cheating `:)`). – Kobi Jan 27 '11 at 10:19
  • 2
    I think Kobi's right. The problem is that many of us are often given a single tool (of which validation frameworks and Apache modules are merely 2 examples of such situations) and are forced to perform all of our tasks with that one tool. If we could choose a more versatile tool to be our one, we could get a lot more common things done without having to hire a professional (e.g. LEX & YACC, ANTLR, etc.). – Gabe Jan 27 '11 at 22:24
  • I think this post is kind of offencive. are you calling senior team leaders and architects users? They believe they are almost gods and are never wrong. I'm shocked you did not get down-voted to hell, but maybe they don't use SO. Most of the time people are using regex, because they have to not because they want to. – IAdapter Feb 09 '11 at 13:10
6

All of those are reasonably possible in Perl.

To match a 1..31 with a regex pattern:

/( [0-9]+ ) (?(?{ $^N < 1 && $^N > 31 })(*FAIL)) /x

To generate something like [1-9]|[12]\d|3[01]:

use Regexp::Assemble qw( );
my $ra = Regexp::Assemble->new();
$ra->add($_) for (1..31);
my $re = $ra->re;                 # qr/(?:[456789]|3[01]?|1\d?|2\d?)/

Perl 5.10+ uses tries to optimise alternations, so the following should be sufficient:

my $re = join '|', 1..31;
$re = qr/$re/;

To match an even number of occurrences:

/ (?: pat{2} )* /x

To match an odd number of occurrences:

/ pat (?: pat{2} )* /x

Pattern negative match:

/<name> (.*?) </name>/x  # Non-greedy matching

/<name> ( (?: (?!</name>). )* ) </name>/x

To get a pattern matching email addresses:

use Regexp::Common qw( Email::Address );
/$RE{Email}{Address}/
ikegami
  • 367,544
  • 15
  • 269
  • 518
5

Probably it is already there and from a long time ago. It's called "grammars". Ever heard of yacc and lex ? Now there is a need for something simple. As strange it may appear, the big strength of regex is that they are very simple to write on the spot.

I believe in some (but large) specialized areas there is already what is needed. I'm thinking of XPath syntax.

Is there a larger (not limited to XML but still simple) alternative around that could cover all cases ? Maybe you should take a look at perl 6 grammars.

kriss
  • 23,497
  • 17
  • 97
  • 116
2

No. We should leave regular expressions as is. They are already far too complicated. When was the last time you thought you had nailed it, i.e., got the whole extended regex syntax (choose your flavour) loaded in your squashy memory?

The theory behind regexes is nice and simple. But then we wanted this and that to go with it. The tool is useful, but falls short on non-regular matching. That is ok!

What most people miss, is that context-free grammars and little specialized interpreters are really easy to write.

Instead of making regexes more difficult, we should be rooting for parser support in standard libraries for our languages of choice!

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200