Syntax for recursive regex in PHP

Question

I'm missing somethings that make me fail on using recursive (?R).

An example to explain my problem 'clearly':

$str1 = "somes text -start bla bla FIND bla bla bla FIND bla FIND bla end-";
$str2 = "somes text -start bla bla FIND bla bla bla FIND bla FIND bla end-";
$my_pattern = "-start .*(FIND).* end-";

preg_replace_callback($my_pattern, 'callback', $str1.$str2);

It will only match the very last FIND.

With the 'ungreedy' option i'll match the 1st FIND of both $str.

But how can i get all of them ? I tried to used '(?R)' but i dont really understand how it work.

Thank.

EDIT: The real work is to find all the 'title' property betweem <a> & </a>. I know it's not optimise to use regex to parse html but it's just a work from school to learn regex.

That's why i didnt put the real work, i wanted to understand and be able to do it myself.

<html>
 <head><title>Nice page</title></head>
<body>
    Hello World
 <a href=http://cyan.com title="a link">
                this is a link
 </a>
<br />
<a href=http://www.riven.com> Here too <img src=wrong.image title="and again">
    <span>Even that<div title="same">all the same</div></span>
</a>
</body>
</html>

My job is too put every titles in uppercase (title="A LINK" for example) using regex.

My last pattern was:

#<a .* title=\"(.*)\".*</a>#Uis

Made me catch (title="a link") and (title="and again"). Your method should work (stribizhev) but i didnt succeed to implement it, i'm still on it.

what do you want? to get all the `FIND` or want to replace all `FIND`? — Mubin, Sep 12 '15 at 10:46
Several problems: Since pattern delimiters are missing, hyphens are seen as delimiters and not as literal characters. I think you confuse *repetition* and *recursion*, you don't need recursion here. You need to search about greedy quantifiers too. See the PHP manual and this post: http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers — Casimir et Hippolyte, Sep 12 '15 at 11:09
@MubinKhalid : Yeah i want to replace them all using callback function. to CasimiretHippolyte : Yeah forgot the delimiter on my example, i'm using '#' as delimiter. — Mickael_42, Sep 12 '15 at 11:41
You need a solution based on \G operator. I wish I could help but I'm on a mobile now. The regex would look like `(?:-start|(?!^)\G).*?(FIND)(?=.*end-)`. Instead of `.*`, you might need a tempered greedy token: `(?:(?!-start|end-|FIND).)*`. And even a sinleline flag `#s` at the end. — Wiktor Stribiżew, Sep 12 '15 at 12:02
@stribizhev Hmm its seems to be right but i dont understand your pattern very well... Could you please explain it steps by steps ? I'm not familliar with the \G operator — Mickael_42, Sep 14 '15 at 18:26
You can have a look at [When is \G useful application in a regex?](http://stackoverflow.com/questions/21971701/when-is-g-useful-application-in-a-regex) and [What good is \G in a regular expression?](http://perldoc.perl.org/perlfaq6.html#What-good-is-%5cG-in-a-regular-expression%3f). — Wiktor Stribiżew, Sep 14 '15 at 18:52
@stribizhev i'll check thoses links, could u pls explain steps by steps too ? — Mickael_42, Sep 14 '15 at 18:57
@stribizhev edit with the real exercice, thank you for your help. — Mickael_42, Sep 14 '15 at 19:55
@Mickael_42: You must be kidding: your job is to manipulate HTML code, and you selected regex? That way you will end up with even more headache when you have to modify anything. I will just update with a proper way using DOMDocument. — Wiktor Stribiżew, Sep 14 '15 at 19:58
@stribizhev i didnt selected regex... i have to do it using regex, its not my call. — Mickael_42, Sep 14 '15 at 20:02

score 1 · Answer 1 · edited May 23 '17 at 12:03

1

UPDATED ANSWER - CHANGING CASE IN HTML

You need to use DOMDocument with DOMXPath to safely get all title attributes and change them with mb_strtoupper:

$html = "<<YOUR_HTML>>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$titles = $xpath->query('//a[@title]');

foreach($titles as $title) { 
   $title->setAttribute("title", mb_strtoupper($title->getAttribute("title"), 'UTF-8'));
}

echo $dom->saveHTML();

See IDEONE demo.

The //a[@title] xpath gets <a> elements (a) with an attribute title.

I use mb_strtoupper assuming you have UTF8 input. Please adjust accordingly, or if you are not planning to use Unicode, just use strtoupper.

ORIGINAL ANSWER BEFORE UPDATE

Here is a regex that will let you replace all FIND substrings inside the -start and -end:

(-start|(?!^)\G)(.*?)FIND(?=.*end-)

See demo

Replace with $1$2NEW_WORD.

PHP code:

$re = "#(-start|(?!^)\G)(.*?)FIND(?=.*end-)#"; 
$str = "somes text -start bla bla FIND bla bla bla FIND bla FIND bla end-"; 
$subst = "$1$2NEW_WORD"; 
$result = preg_replace($re, $subst, $str);
echo $result;

NOTE: If you have several start-end blocks, you will most probably need a tempered greedy token (?:(?!-start|end-|FIND).)* instead of .*? and .*.

The regex breakdown:

(-start|(?!^)\G) - This group contains two alternatives:
- -start - matches the literal string -start
- (?!^)\G - asserts the position in the original input string right after the last successful match. \G can also assert the beginning of the string, but we exclude it with the negative look-ahead.
(.*?) - Match any number of characters but as few as possible
FIND - literal string FIND
(?=.*end-) - only if there is literal string end- after the FIND.

For more information on \G operator, see When is \G useful application in a regex? and What good is \G in a regular expression?.

edited May 23 '17 at 12:03

Community

1
1

answered Sep 12 '15 at 12:37

Wiktor Stribiżew

607,720
39
448
563

Could you please explain to me how this "|(?!^)\G)" works pls ? – Mickael_42 Sep 14 '15 at 17:44
`\G` asserts the position at the beginning of the string (as `^`) or the end of the last successful match. That means, it enforces all multiple matches to be consecutive in the input string. The negative lookahead before `\G` restricts it to match just consecutive matches. Since `-start` is the first alternative in the first group, it is checked for first. If a match is found (i.e. `somethingFIND`) then all consecutive matches (anything after the first `FIND` and again `FIND`) are matched. – Wiktor Stribiżew Sep 14 '15 at 18:07
i do not succeed to adapt it on my work... i'm missing somethings :x – Mickael_42 Sep 14 '15 at 19:24
Just as a [bonus](http://ideone.com/PXSYYU) for you. But do not use it: with large pages this will lead to catastrophic backtracking one day. – Wiktor Stribiżew Sep 14 '15 at 20:17
Please check this updated code with [`'/()/i'`](http://ideone.com/PXSYYU). I will delete it, and you must delete the "school" comment. – Wiktor Stribiżew Sep 14 '15 at 20:25
ok i'll looking on it and try to understand it. Thank you very much – Mickael_42 Sep 14 '15 at 20:45
your code doesn't match all (title=""), but only the 1st one like i do. – Mickael_42 Sep 15 '15 at 19:04
What is you requirement? You yourself said you needed to replace `title` attribute values with uppercase texts in `` tags only, and now you wish to also modify `title` attributes in `img`, `div` and any other tags? Please be specific. If you need a regex that will match `title` attribute in all tags, just remove `a\s+` from the one above. – Wiktor Stribiżew Sep 16 '15 at 09:18

score 0 · Answer 2 · edited May 23 '17 at 10:27

0

If using preg_replace_callback why wouldn't reluctant .*? be convenient.

$my_pattern = "/-start(.*?)end-/s";

$str = preg_replace_callback($my_pattern, function($matches) {
  return str_replace("FIND", "<b>FIND</b>", $matches[0]);
}, $str1.$str2);

Or do something else in callback. What are you trying to achieve?

edited May 23 '17 at 10:27

Community

1
1

answered Sep 12 '15 at 13:27

bobble bubble

1

Syntax for recursive regex in PHP

2 Answers2

UPDATED ANSWER - CHANGING CASE IN HTML

ORIGINAL ANSWER BEFORE UPDATE