0

I am using regex in PHP to eliminate italic html orphan labels from a text line (living the text untouched if the labels are paired), using regex assertions, and I am being successful to do so when I want to eliminate <i> that is not followed by </i> later in the same line using this sentence:

$buffer = preg_replace('/(.*)<i>(?!.*<\/i>)(.*)/', "$1$2$3", $buffer);

but, when I try to do the same for </i> labels not preceded by <i> in the same line, using:

$buffer = preg_replace('/(.*)(?<!<i>.*)<\/i>(.*)/', "$1$2$3", $buffer);

It returns null (I think meaning syntax error). I know that the problem is the "*" after "?<!<i>." because if I use:

$buffer = preg_replace('/(.*)(?<!<i>.)<\/i>(.*)/', "$1$2$3", $buffer);

and I test with an string with only one character between <i> and </i> (like "TEST<i>1</i>WORKS") it goes fine, but of course, this is no useful, as I do not know how much characters will be between <i> and </i> in operations. I am assuming that the negative lookahead assertion of the first command should have a symmetric behaviour than the negative lookbehind assertion of the second, but it seems not to be the case.

Can someone wise in regex tell me how to circumvallate this issue?

Thanks to all and best regards.

  • 1
    [Famously](https://stackoverflow.com/a/1732454/740553): don't use regex to parse HTML Use an actual DOM parser instead, like php-html-parser or simplehtmldom or the like. – Mike 'Pomax' Kamermans Apr 29 '21 at 19:10
  • Assertions are of no use here. You'd typically use a [`(*SKIP)(*FAIL)`](https://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex) *alternative* for cases to ignore. – mario Apr 29 '21 at 19:27
  • Well, the text to be parsed is not HTML. I used" \" and "\", but I could have used #startt# #end#. I only use those two labels to delimitate an italic styling. Then I think (*SKIP) and (*FAIL) are not available in my environment. – Alberto Suárez Apr 29 '21 at 20:40
  • 1
    According to https://www.php.net/manual/en/regexp.reference.assertions.php: "The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length." Since your pattern does not have a fixed length, this approach can't work. You may have to just iterate your string and look for matching symbol pairs, or if they may be nested, parse it into a symbol stack and look for mismatches? – Don R Apr 29 '21 at 21:12
  • Thank you Don R. That reference is what I needed – Alberto Suárez Apr 29 '21 at 23:19

0 Answers0