PHP - ungreedy regular expression still a little 'greedy'

Question

For my CMS I need replace multiline content between [?][/?] tags if it contains string %empty%, leaving untouched if %empty% mark is not found.

$a='
[?]<h1>%empty%</h1>
<p>text</p>
[/?]  
text        
[?]<h1>%empty%</h1>
<p>text</p>
[/?]  
text';

$r= preg_replace (
  '/(\[\?\]).*?%empty%.*?(\[\/\?\])/s',           
  "REPLACED",   
  $a )   ;
echo $r;

Right result:

REPLACED  
text        
REPLACED  
text

It works well in almost every combination, except if first line is unmatched. In this case is replaced all content between first [?] and last [/?]

$a='
[?]<h1>%!empty%</h1>
<p>text</p>
[/?]  
text        
[?]<h1>%empty%</h1>
<p>text</p>
[/?]  
text';

Wrong result:

REPLACED  
text        

Expected:

[?]
<h1>%!empty%</h1>
<p>test</p>
[/?] 
text  
REPLACED  
text

I am using both ungreedy and 'lazy' regular exceptions with same result. I thing that I need explicit define second [/?] in regexp, but without success.

Corrected. First line must be untouched. In wrong result is 'eaten' all content between both marks. Result without both [?] [/?] is also ok. — Axis, Jul 26 '19 at 12:41
Remove the `s` modifier in `/s`, it will match the dot also match a newline. — The fourth bird, Jul 26 '19 at 12:47
Yes, but key function is that it works with more lines of text between marks. I'm sorry, I'll change it. — Axis, Jul 26 '19 at 12:50
You should edit your question to include a multiple example if that's part of your requirements. — joanis, Jul 26 '19 at 12:59
The problem is that even non-greedy search works left to right. After seeing `[?]`, greedy says "take the longest continuation I can take from here", non-greedy says "shortest", but neither says "is there a later `[?]` I could have started from?" So you need to replace `.*` by something that will not match the next `[?]`, which won't be trivial but should be possible. — joanis, Jul 26 '19 at 13:01
Is there any chance you might be able to count on `[` or `]` not occurring in the text? If so, replacing `.*` by `[^][]*` might work. Demo: https://regex101.com/r/Cgx19h/1 — joanis, Jul 26 '19 at 13:08
If `[` or `]` are allowed in text but have to be escaped, replacing `.*` with `([^][]|\\\[|\\\])` would work. Demo: https://regex101.com/r/Cgx19h/2 But if you allow square braces in general I'm not sure how I'd do it. — joanis, Jul 26 '19 at 13:11
@splash58 Seems almost like good solution, but opening squared brace after opening [?] it ruined :-(, but I think it's a way. — Axis, Jul 26 '19 at 13:44
@joanis: testing if `[` is preceded by a backslash doesn't prove anything: 1. you don't know if this syntax has an escape character and if this one is the backslash, 2. even if there's one and if it is the backslash, what if `[` is preceded by a literal backslash, examples (raw strings): `a\\[`, `a\\\\[`, `a\\\\\\[`, ... — Casimir et Hippolyte, Jul 26 '19 at 13:48
Perhaps using 2 times a tempered greedy token https://regex101.com/r/2UWj23/1 — The fourth bird, Jul 26 '19 at 14:02
At this point I'm just waiting for @Axis to specify whether squares braces are allowed in text or not, and with what syntax, then we'll know what we're aiming for. Those were just two ideas based on simple assumptions. — joanis, Jul 26 '19 at 14:08
@Thefourthbird I like this solution, it might be the simplest and it should handle all the allowed possibilities of squared braces being allowed and in what way in the text. Elegant. Seems worth writing up and posting as a solution to me. — joanis, Jul 26 '19 at 14:09
@Axis I don't gather what is wrong - https://regex101.com/r/Cgx19h/6 — splash58, Jul 26 '19 at 14:36
@joanis It will be generic HTML, i cannot exclude square brackets from allowed chars. — Axis, Jul 26 '19 at 15:34
Then I would go for the solution @Thefourthbird linked to. I don't want to post it as an answer since they should get the credits for it. — joanis, Jul 26 '19 at 16:06
The solution linked by @splash58 seems just as good, take your pick. — joanis, Jul 26 '19 at 16:09

The fourth bird · Accepted Answer · 2019-07-26T23:24:11.187

For your current example data, if you want to match from [?] till [/?] and in between there can not be [?] and there must be %empty%, you might make use of a tempered greedy token.

Using the /s modifier to make the dot match a newline:

\[\?\](?:(?!\[/?\?\]).)*%empty%(?:(?!\[\?\]).)*\[/\?]

Explanation

\[\?\] Match [?]
(?: Non capturing group
- (?!\[/?\?\]). Assert what is directly on the right is not [?] or [/?]. Then match any char.
)* Close non capturing group and repeat 0+ times
%empty% Match literally
(?: Non capturing group
- (?!\[\?\]). Assert what is directly on the right is not [?]. Then match any char.
)* Close non capturing group and repeat 0+ times
\[/\?] Match [/?]

Regex demo

Edit

@Casimir et Hippolyte suggests a more performant pattern using a Unrolled Star Alternation Solution approach:

\[\?\][^[%]*+(?:\[(?!\?])[^[%]*|%(?!empty%)[^[{%]*)*+%empty%[^[]*+(?:\[(?!/?\?])[^[]*)*+\[/\?]

Explanation

\[\?\] Match [?]
[^[%]*+ Negated character class, match any char except [ ] %
(?: Non capturing group
- \[(?!\?]) Match [, assert what is directly on the right is not ?]
- [^[%]*If that is the case, match 0+ times any char except [ %
- | Or
- %(?!empty%) Match %, assert what is directly on the right is not empty%
- [^[{%]* If that is the case, match 0+ times any char except [ {
)*+ Close non capturing group and repeat 0+ times using a possessive quantifier
%empty%[^[]*+ Match %empty% and 1+ times any char except [ ]
(?: Non capturing group
- \[(?!/?\?]) Match [, assert what is directly on the right is not an optional / and ?]
- [^[]* If that is the case, match 0+ times any char except [
)*+ Close non capturing group and repeat 0+ times
\[/\?] Match [/?]

Regex demo

Speechless.Thank You for perfect solution and points to study. — Axis, Jul 26 '19 at 16:50
The second negative lookahead should be: `(?!\[/\?\])` (according to your explanation). Personnaly, I prefer writing it like this: https://regex101.com/r/0C8DQe/2 — Casimir et Hippolyte, Jul 26 '19 at 20:19
@CasimiretHippolyte Awesome! But I did mean to match `\[\?\]` in the second negative lookround to not match `[?]` in between. I chose [this approach](https://www.rexegg.com/regex-quantifiers.html#tempered_greed) and I think you chose [this approach](https://www.rexegg.com/regex-quantifiers.html#explicit_greed) I have made a minor change to your version to not match the `[?]` in between https://regex101.com/r/6H1DYc/1/ If you agree and with your permission can I add it to the answer or do you want to post it? — The fourth bird, Jul 26 '19 at 22:16
I said that because you wrote in your explanations: *Assert what is directly on the right is not `[/?]`*. Sorry It was obviously a `[` and not a `{` in my previously linked pattern. About the approach, it is more the next chapter [Unrolled Star Alternation Solution](https://www.rexegg.com/regex-quantifiers.html#unrolled_staralt). Feel free to add or not to add it to the answer, I will not post an answer. — Casimir et Hippolyte, Jul 26 '19 at 22:44
@CasimiretHippolyte I see now what you meant, I have updated it. Thank you for your feedback, I really appreciate it! — The fourth bird, Jul 26 '19 at 22:51

PHP - ungreedy regular expression still a little 'greedy'

1 Answers1