2

I'm working with pattern matching in Postgresql 9.4. I run this query:

select regexp_matches('aaabbb', 'a+b+?')

and I expect it to return 'aaab' but instead it returns 'aaabbb'. Shouldn't the b+? atom match only one 'b' since it is not greedy? Is the greediness of the first quantifier setting the greediness for the whole regular expression?

lufte
  • 1,283
  • 1
  • 11
  • 21

1 Answers1

3

Here is what I've found in postgresql 9.4's documentation:

Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.

and

If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.

An example of what this means:

SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1

In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1. The sub-expression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.

Meaning that the greediness of an operator is determined by the the ones defined prior to it.

I guess you have to use a+?b+? for achieving what you want.

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
  • I don't find the first sentence in the docs. Can you link it? – lufte Apr 11 '15 at 20:47
  • I think i took it from a pdf for which i lost the link to.. although [this](http://stackoverflow.com/questions/25690218/why-wont-this-regex-work-in-postgresql) is what google gave me. And also you can read section `9.7.3.5` from quoted document in the answer on how the matching is done in psql – karthik manchala Apr 11 '15 at 20:57
  • A link to an outdated manual page (why Postgres 9.0?) is one thing. But a completely incorrect quote is another thing. Please fix it or delete it. – Erwin Brandstetter Apr 11 '15 at 21:35
  • Updated the reference and fixed the quote... thanks :) – karthik manchala Apr 11 '15 at 21:43
  • Sorry but I'm still not convinced with this answer. Yes, the first paragraph tells me that if I try to match `'.*.*'` against `'aaabbb'`, the first `.*` will eat all the characters because it's the first expression, despite that they are both greedy. The second paragraph tells me that depending on greediness, the expression matches the longest substring or the shortest and not something in the middle. Both statements are true (and not only for Postgresql) but they don't explain why the non-greedy quantifiers stop working if I already used a greedy one. – lufte Apr 12 '15 at 16:50
  • I re-read the documentation and the quote you provided seems to be the answer @karthikmanchala. There is also an example of this in the section that starts with "An example of what this means:...", if you could also quote that part it would be great. – lufte Apr 26 '15 at 20:21