-1

In the string:

<ut Type="start" Style="external" RightEdge="angle" DisplayText="P id=&quot;2&quot;">&lt;tr&gt;&lt;td width="10%" bgcolor="#C0C0C0" valign="top"&gt;&lt;p align="right"&gt;2&lt;/td&gt;&lt;td width="90%"&gt;</ut><Tu MatchPercent="100"><Tuv Lang="EN-US"><ut Type="start" RightEdge="angle" DisplayText="csf style=&quot;Italic CH&quot; italic=&quot;on&quot;">&lt;!-- 1 --&gt;&lt;FONT COLOR="#FF0000"&gt;&amp;lt;csf style=&quot;Italic CH&quot; italic=&quot;on&quot;&amp;gt;&lt;/FONT&gt;</ut>Battlefield™ V<ut Type="end" LeftEdge="angle" DisplayText="1">&lt;!-- 1 --&gt;&lt;FONT COLOR="#FF0000"&gt;&amp;lt;/1&amp;gt;&lt;/FONT&gt;</ut> (Xbox One)</Tuv><Tuv Lang="NL-NL"><ut Type="start" RightEdge="angle" DisplayText="csf style=&quot;Italic CH&quot; italic=&quot;on&quot;">&lt;!-- 1 --&gt;&lt;FONT COLOR="#FF0000"&gt;&amp;lt;csf style=&quot;Italic CH&quot; italic=&quot;on&quot;&amp;gt;&lt;/FONT&gt;</ut>Battlefield™ V<ut Type="end" LeftEdge="angle" DisplayText="1">&lt;!-- 1 --&gt;&lt;FONT COLOR="#FF0000"&gt;&amp;lt;/1&amp;gt;&lt;/FONT&gt;</ut> (Xbox One)</Tuv></Tu><ut Type="end" Style="external" LeftEdge="angle" DisplayText="P">&lt;/td&gt;&lt;/tr&gt;</ut>`

I want to replace &quot; with &amp;quot;

This should only happen if the string is surrounded by FONT tags, like in this case.

I'm using PHP:

$postproc = preg_replace('#(FONT|\G(?!\A))((?!/FONT).*?)&quot;(?!/FONT)#', '$1$2&amp;quot;', $postproc);

This however does not work.

Here we have a similar situation:

$postproc = preg_replace('#(DisplayText="|\G(?!\A))([^">]*)"(?!\s*>)#', '$1$2&quot;', $postproc);

This replaces all " quotes inside DisplayText tags with $quot; The main difference is that the DisplayText tag ends with one character ("), while the above FONT tag ends with a series of multiple characters, so that I need a negative lookahead instead of the simple [^">] negation.

I've really tried. For eight hours to be precise. I'm stuck.

$postproc is used on an entire file containing all kinds of tags, amongst which multiple FONT and DisplayText tags as mentioned above, and each tag can contain multiple replacements.

miken32
  • 42,008
  • 16
  • 111
  • 154
Loek van Kooten
  • 103
  • 1
  • 7
  • Stop using regex to parse HTML and you'll have a much happier time. [ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/1902010) – ceejayoz Oct 15 '18 at 21:17
  • Trust me. I wish I could. Unfortunately, there is no way around this, mainly because this is all in very old Unicode that is no longer supported by PHP. – Loek van Kooten Oct 15 '18 at 21:24

2 Answers2

1

You could use

(?:\G(?!\A)|FONT)
(?:(?!FONT).)+?\K
(?<!&amp;)&quot;

Which needs to be replaced by &amp;&quot;, see a demo on regex101.com.


Broken down, this reads:
(?:\G(?!\A)|FONT) # match FONT or at the end of the last match
(?:(?!FONT).)+?\K # match everything that comes lazily
                  # do not overrun FONT, forget what has been matched
                  # thus far (\K)
(?<!&amp;)&quot;  # match &quot; only when it is not preceeded by &amp;


Even better yet: where does this string come from? Can you manipulate the origin? Also, the abovementioned answer won't work with nested FONT "tags".
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thank you very much. Thing is, this doesn't work either. You can see if you try it on the new text I'm about to add to my original question right now. Your code also fetches the &quots in the DisplayText="blablabla" parts, but those need to stay as is. The code to be processed can't be changed: it comes from very old software my client is still using, and there's nothing I can do to change its output. – Loek van Kooten Oct 15 '18 at 21:13
0

This works though!

$postproc = preg_replace('#(?:\G(?!\A)|&lt;FONT)(?:(?!FONT).)+?\K(?<!&amp;)&quot;#', '$1$2&amp;quot;', $postproc);

It's the extra &lt; in the first non-capture group that does the trick.

Loek van Kooten
  • 103
  • 1
  • 7