5

I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.

user56reinstatemonica8
  • 32,576
  • 21
  • 101
  • 125
  • 1
    It matches the shortest possible path from the *first* occurrence of `HOHO`; and it can do this without backtracking, so it won't try to shorten that. – Ja͢ck Dec 10 '14 at 10:53

3 Answers3

11

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one
Community
  • 1
  • 1
user56reinstatemonica8
  • 32,576
  • 21
  • 101
  • 125
  • 1
    The explanation of the first part is good, your example to solve the current problem is wrong. You need to better understand the character class notion and why writing `[^HOHO]` has no sense. – Casimir et Hippolyte Dec 09 '14 at 18:22
  • You're right, thanks, it's finding characters that aren't H or O rather than testing against the string, I need to do something using negative lookarounds similar to this http://stackoverflow.com/questions/977251/regular-expressions-and-negating-a-whole-character-group - will come back and edit when I have time – user56reinstatemonica8 Dec 10 '14 at 01:23
  • Okay, think I've fixed it now – user56reinstatemonica8 Dec 10 '14 at 10:51
  • It took me white a while to understand what `(?:(?!HOHO).)*` does or rather how it works. For me this is the explanation what I found made it clear: "From the current cursor, are the next 4 characters "HOHO"? If YES, we're done with this part and moving on to the next part of the regex. If NO, then grab the character at the cursor, move the cursor past it, repeat." – Scratte Feb 03 '21 at 05:10
5

Why it matches the whole string?

This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.

Example:
Let's consider a regular expression /a+?b/ and test string "aaaaaaaaab". When applied to the string it matches the whole string. Not just last a & b. This is because the first position in the string where a match is possible is at the first a.

So, if you want to match ab in aaaaaaaaab, use a negated character class based regex rather than a lazy dot:

a[^ab]*b

See the regex demo.

Source: Javascript: The Definitive Guide, Sixth Edition, Page Number: 255

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Vishal
  • 166
  • 4
  • 2
    It would be nice to mention your [sources](http://docstore.mik.ua/orelly/webprog/jscript/ch10_01.htm#jscript4-CHP-10-SECT-1.3.1) as well. – Ja͢ck Dec 10 '14 at 11:22
  • Kindly read [How to reference material written by others](https://stackoverflow.com/help/referencing). TL;DR: You need to link to the source and provide with the name of the author or organization (if there's no individual author). In this case it seems David Flanagan is the author of the book. – Scratte Feb 03 '21 at 04:37
4

The result is non-greedy, because it's the shortest match from the first occurrence of HOHO until _HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.

To make it work in the way that's expected here, you need to have a greedy prefix in your expression:

/.*(HOHO.*?_HO_)/

The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of HOHO first.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • 1
    I've been reading http://blog.stevenlevithan.com/archives/greedy-lazy-performance to try to better understand backtracking. Would it be true that in this, it would match the entire string, then backtrack until it finds the string's last `HOHO`, then match forwards until it reaches `_HO_`? So in a string like `HOHO_1_HO_|HOHO_2_HO_|HOHO_3_HO_` it would match `HOHO_3_HO_` only? I'm also wondering if the example in my answer might be more efficient in case of very long strings? – user56reinstatemonica8 Dec 10 '14 at 11:15
  • @user568458 Right, that's what my answers suggests; i.e. it will match the last occurrence of "HOHO" first :) performance may be impacted, but only a benchmark will give *some* assurance. – Ja͢ck Dec 10 '14 at 11:17