1

What general (i.e. not including any literals specific to the pattern as a whole) PCRE sub-expression placed between two sub-expressions each consisting of literal characters will cause the pattern to match the smallest number of characters?

Note that this question is not satisfied by:

  • Any existing answer to this related question.

  • Any existing answer to other related questions I've found on this site, including all suggested by Questions that may already have your answer.

  • .*? (lazy). This (as stated here) does not necessarily match the smallest number of characters.

  • [^x]* where x is a literal character copied from elsewhere in the pattern. The requirement is for a expression that is general i.e. not including any literals specific to the pattern as a whole. To avoid attracting such wrong answers, this question deliberately doesn't provide examples of failing patterns.

Community
  • 1
  • 1
ChrisJJ
  • 2,191
  • 1
  • 25
  • 38
  • Well, a good example would be in one of your links: `abcabk`, where `a.+?k` matches the entire string instead of just the last 3 characters. (Where a solution is indeed to copy the double character into a Not-set – which you want to avoid.) – Jongware Dec 04 '16 at 23:19
  • What are you actually trying to do? – Robert Dec 04 '16 at 23:34
  • This site is for specific questions about actual problems you're facing. Your question is far too vague and broad in scope. Voting to close until it becomes more specific, with an actual problem statement and a specific question. – Ken White Dec 04 '16 at 23:49
  • I've not prevented anything. It takes 5 votes to put a question on hold. I've had the courtesy to explain to you why I cast mine to do so. Here's a better idea: Why not edit your question to make it more specific as asked, so I can remove my close vote? – Ken White Dec 05 '16 at 01:57
  • Clearly our opinions on that differ. This is not a debate. IMO, your question is too vague and should be closed for that reason, and I've voted accordingly. I didn't ask for your agreement. – Ken White Dec 05 '16 at 02:01
  • 1
    There's nothing vague about this question. He wants to know which `whatever` will give him the shortest match anywhere of `leftwhateverright` instead of the leftmost shortest match that `left.*?right` gives with the additional requirement that the solution should not require `whatever` to change when `left` or `right` changes so the regex can be used as a template. – Jan Goyvaerts Dec 06 '16 at 03:26

1 Answers1

2

This regex finds the shortest match that starts with left and ends with right and allows any text in between:

(left)(?:(?!(?1)).)*?(?:right)

This regex works with PCRE and any other flavor that supports subroutine calls. The left and right bits don't have to be literal text. They can be any regex as long as they don't match the same text. The only other issue you need to take into account is that if left and right contain capturing groups then the numbers of those groups will be shifted compared to left and right as stand-alone regexes.

Backtracking regex engines always return the leftmost match. In the case where left and right differ, we can force it to return the shortest match by making sure that the match does not contain left more than once.

When left and right are the same there is no pure regex solution. What you can then do is to split the string along all matches of left and then find the shortest string in the resulting array.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • This is great but may not work if the left and right sub-patterns can match the same thing. – Walf Dec 06 '16 at 13:03
  • @jan Ah, cool. A negative lookahead using a subroutine to reference the left literal, thereby excluding the left from the body. And then the body wild card made lazy to exclude the right literal from the body. Though it doesn't quite meet the requirement ("placed between two sub-expressions each consisting of /literal/ characters") it sure is useful and the best answer so far! :-) Thanks. – ChrisJJ Dec 06 '16 at 22:23
  • @Walf I've been unable to find such a fail case. Can you cite one? – ChrisJJ Dec 06 '16 at 22:24
  • @Jan, This fails when a capturing group is present before 'left'. A fix (for PCRE >=7.0 and PCRE2) is `(left)(?:(?!(?-1)).)*?(?:right)`. – ChrisJJ Dec 06 '16 at 22:47
  • `(marker)(?:(?!(?1)).)*?marker` against `markerXXXmarkerXmarker` – Walf Dec 06 '16 at 23:24
  • Putting capturing groups before `(left)` will change the number of that group. So then you also need to change the number of the subroutine call. You can use a named group and subroutine call to avoid this. `(?-1)` will break if `left` is substituted with something that contains capturing group numbers which thus fails your requirement that the solution needs to be independent of what `left` and `right` actually are. – Jan Goyvaerts Dec 07 '16 at 00:42
  • @Walf. Agreed. Thanks. – ChrisJJ Dec 07 '16 at 01:22
  • @jay "Putting capturing groups before (left) will change the number of that group" Surely it doesn't change the relative number of that group `-1`. "(?-1) will break if left is substituted with something that contains capturing group numbers" Fortunately that case is outside the given constraint 'between two sub-expressions each consisting of /literal/ characters'. – ChrisJJ Dec 07 '16 at 01:58