32

It seems that this is a huge source of confusion for beginners writing regular expressions, can cause hidden performance problems, and it would seem that a typical use case would be non-greedy.

Is this just for legacy reasons (it was how it was first done, and every implementation copies that), or is there a reason for it?

AndersTornkvist
  • 2,610
  • 20
  • 38
Yishai
  • 90,445
  • 31
  • 189
  • 263
  • 2
    Whoever voted to close as subjective & argumentative, care to elaborate? – falstro Feb 16 '10 at 17:47
  • Regular expressions aren't greedy by default, but their quantifiers are :-) – Andy E Feb 16 '10 at 17:49
  • It seems to me the real question is, why are lazy quantifiers more poorly supported and/or awkward to use than greedy ones? – Ipsquiggle Feb 16 '10 at 17:57
  • 1
    this question has bugged me as well, logically thinking, creating a lazy regex engine is far more efficient and easy then a greedy one, they should have made the default mode as lazy, coz thats one perceive by default. Plus in all my regex using life, I dont remember using a greedy match more than 1% of the time – shabby Jan 15 '14 at 22:47

6 Answers6

11

Hysterical Raisens


Part of the answer may involve the origins of REs in practical computing. They were originally a theoretical concept from automata theory and formal language theory until Ken Thompson himself wrote a real implementation and used them in qed and ed(1).

The original version had only the greedy syntax and so there wasn't a decision to even make.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
  • 3
    I'm not so sure you can say that theoretical regular languages are greedy by default. I think a Kleene regular expression defines a set of all the strings that could match it, so `/x*/` could match "" or "x" or "xxx" (etc.). Such an expression defines a regular language including the strings "", "x" and "xxx". Note that this says nothing about how to go about searching for matches in a body of text; it's only when you apply the theory that you start caring about greediness. – Nate C-K Feb 16 '10 at 18:19
  • Sure, sure, by *"original version"* I just meant *"as Ken Thompson typed it in for those editors",* and in those versions and for almost a decade afterwards, ed, grep, ex, and vi only did greedy pattern matching. – DigitalRoss Feb 16 '10 at 18:23
  • 1
    Ah, in that case we're in agreement. – Nate C-K Feb 17 '10 at 00:13
9

In the case of performance, lazy quantifiers aren't always faster because of backtracking: http://blog.stevenlevithan.com/archives/greedy-lazy-performance

As for the actual design, I honestly can't say why quantifiers are greedy by default but I do wonder what control character would have been used to make a quantifier greedy instead of lazy. I don't think ? would have cut it :-)

Andy E
  • 338,112
  • 86
  • 474
  • 445
6

Possible reason: The regex engine needs to backtrack a lot if it's non-greedy.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
3

Well, it is important that computers behave predictably whenever possible. So the correct behavior should follow a simple rule, like greedy matching, so that at least experienced programmers can predict the outcome of a piece of code.

As for whether a typical use case should be non-greedy, what about the following: suppose I have a file with entries like foo1909, bar3939, baz3331, and I just want to extract these numbers. It seems natural enough to write (\d*) as the regular expression for this.

You might say that it is just as easy to write (\d*)\D or whatever, but it is basically always the case that the programmer can be more explicit and less ambiguous. Since we wanted a default behavior that was 100% predictable, and trivial to calculate in ones head, it seems reasonable to me.

forefinger
  • 3,767
  • 1
  • 22
  • 18
  • 3
    This is a perfectly logical and reasonable guess, however, it's quite unrelated to the real reason, which is simply that non-greedy came much, much, later and so it was not the default. – DigitalRoss Feb 16 '10 at 18:16
3

The real issue here is the Kleene closure operator (star); for everything else in a regular expression, the longest match is the same as the shortest match.

When you think about it in those terms, you realize that more modern tools realize you need both. I'm up late so I can think of only two examples:

  • Both ksh and bash provide "longest match" and "shortest match" forms of most of the special variable-altering operators.

  • The Lua regular expressions include * for Kleene closure longest match and - for Kleene closure shortest match. This one always bites me when I forget to escape a literal - sign.

It would be interesting to go back to Kleene's original work and see if that might have influenced early tools toward longest match.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • 1
    What about alternation? Given the regex `/foo|foobar/` and the target string `blahfoobarblah`, a mathematically-pure regular expression will always match `foobar`, while a Perl-derived, NFA regex will settle for `foo`. – Alan Moore Feb 17 '10 at 05:24
1

it would seem that a typical use case would be non-greedy.

I want to make it clear that this is wrong, unless "typical use case" means HTML-hacking.

An easy example are lexical analysers for programming languages. You simply don't want

foo = 42

to be interpreted as 3 variables, followed by an equal sign, followed by 2 numbers. On the contrary, typically you expect your parser to consider the longest possible matches.

Before the advent of HTML, we elder ones have lived for decades with greedy regular expressions, and we did just fine. Even today I do not use non-greedy ones in 99% of all cases, admittedly because I am too lazy to look up the syntax, but also because the occasions are seldom where you couldn't just write a well terminated greedy one. For example, to match a string:

"(\\"|[^"])*"
Ingo
  • 36,037
  • 5
  • 53
  • 100
  • I wouldn't think lexical analysers get much from looking "for something" while also just making sure everything else is "not something". Sure it looks decent for strings which are delimited by a single symbol, but you try and do it for something else delimited by multiple and it gets ugly quick, for example a multi-line comment: `/\*((?!\*/).)*\*/`. Vs `/\*.*?\*/`. It gets worse the more delimiters you add because you have to negate all of them. Greedy *works* for most use cases because these issues don't frequently arise, but claiming that it benefits lexical analysis or is otherwise wrong... – TrisT Jun 21 '21 at 02:19