I'm developing a Telegram Bot in PHP where I have to handle strings in which only some basic HTML tag are allowed and All <
, >
and &
symbols that are not a part of a tag or an HTML entity must be replaced with the corresponding HTML entities (<
with <
, >
with >
and &
with &
)
Example string
<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<
I managed to replace &
and <
by using Regex. For example I used negative lookahead in this pattern <(?!(?:(?:\/?)(?:(?:b>)|(?:strong>)|(?:i>)|(?:em>)|(?:code>)|(?:pre>)|(?:a(?:[^>]+?)?>))))
to get rid of <
symbol.
But I'm unable to build a pattern to replace >
symbol which is not a part of any tag. PCRE does not support indefinite quantifiers in look behinds. Although it allows alternatives inside lookbehinds to have different lengths but requires each alternative to have fixed length.
So, I tried to use this pattern (still incomplete) (?<!(?:(?:<b)|(?:<strong)|(?:<i)|(?:<em)|(?:<code)|(?:<pre>)|(?:<a)))>
in which all the alternatives have fixed lengths, but it still says Compilation failed: lookbehind assertion is not fixed length