1

I'm developing a Telegram Bot in PHP where I have to handle strings in which only some basic HTML tag are allowed and All <, > and & symbols that are not a part of a tag or an HTML entity must be replaced with the corresponding HTML entities (< with &lt;, > with &gt; and & with &amp;)
Example string

<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<

I managed to replace & and < by using Regex. For example I used negative lookahead in this pattern <(?!(?:(?:\/?)(?:(?:b>)|(?:strong>)|(?:i>)|(?:em>)|(?:code>)|(?:pre>)|(?:a(?:[^>]+?)?>)))) to get rid of < symbol.

But I'm unable to build a pattern to replace > symbol which is not a part of any tag. PCRE does not support indefinite quantifiers in look behinds. Although it allows alternatives inside lookbehinds to have different lengths but requires each alternative to have fixed length.

So, I tried to use this pattern (still incomplete) (?<!(?:(?:<b)|(?:<strong)|(?:<i)|(?:<em)|(?:<code)|(?:<pre>)|(?:<a)))> in which all the alternatives have fixed lengths, but it still says Compilation failed: lookbehind assertion is not fixed length

ManzoorWani
  • 1,016
  • 7
  • 14
  • Was gonna do a good answer for ya buddy. Leave, comeback with a good regex solution, but see you've already marked a short regex solution that will never work. Unfortunately, I can't erase my answer. I'll know better next time when I see your name. –  Mar 09 '17 at 18:10

2 Answers2

1

The correct answer would be to use a DOM parser instead. For a quick and dirty (and sometimes faster) way though, you could use the (*SKIP)(*FAIL) mechanism which PCRE implements:

<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+

See a demo on regex101.com.


A complete PHP walk-through would be:
<?php
$string = <<<DATA
<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<
DATA;

$regex = '~<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+~';
$string = preg_replace_callback($regex,
    function($match) {
        return htmlentities($match[0]);
    },
    $string);

echo $string;
?>

Which yields:

<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes&lt;b bad&lt;&gt;b&gt; &lt;bad&amp; hi&gt;;<strong >b&lt;a&lt;

However, as stated many times on StackOverflow before, consider using a parser instead, after all that's what they are made for.


A parser way could be:
$dom = new DOMDocument();
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR);

echo $dom->saveHTML();

However, your presented snippet is corrupt so regular expressions might be the only way to handle it.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

You can find legitimate special symbols to be converted to entities like this.

The big thing is properly parsing a tag.
Disclaimer - If you don't do it the way below, there is no reason to even use regex, it will not work.

On each match, group 0 will contain either <,>, or &
You can add more, see the regex at the bottom

The regex
(?:(?><(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)(*SKIP)(*FAIL)|[<>]|[&](?!(?i:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)));))

Explained

 (?:
      (?>                           # Atomic group
           <                             # Match tag forms and fail them with skip / fail verbs ( see below )
           (?:
                (?:
                     (?:
                                                        # Invisible content; end tag req'd
                          (                             # (1 start)
                               script
                            |  style
                               #|  head
                            |  object
                            |  embed
                            |  applet
                            |  noframes
                            |  noscript
                            |  noembed 
                          )                             # (1 end)
                          (?:
                               \s+ 
                               (?>
                                    " [\S\s]*? "
                                 |  ' [\S\s]*? '
                                 |  (?:
                                         (?! /> )
                                         [^>] 
                                    )?
                               )+
                          )?
                          \s* >
                     )

                     [\S\s]*? </ \1 \s* 
                     (?= > )
                )

             |  (?: /? [\w:]+ \s* /? )
             |  (?:
                     [\w:]+ 
                     \s+ 
                     (?:
                          " [\S\s]*? " 
                       |  ' [\S\s]*? ' 
                       |  [^>]? 
                     )+
                     \s* /?
                )
             |  \? [\S\s]*? \?
             |  (?:
                     !
                     (?:
                          (?: DOCTYPE [\S\s]*? )
                       |  (?: \[CDATA\[ [\S\s]*? \]\] )
                       |  (?: -- [\S\s]*? -- )
                       |  (?: ATTLIST [\S\s]*? )
                       |  (?: ENTITY [\S\s]*? )
                       |  (?: ELEMENT [\S\s]*? )
                     )
                )
           )
           >
      )                             # End atomic group
      (*SKIP)(*FAIL)

   |                              #or, 
      [<>]                          # Angle brackets

   |                              #or, 
      [&]                           # Ampersand 
      (?!                           # Only if not an entity
           (?i:
                [a-z]+ 
             |  (?:
                     \#
                     (?:
                          [0-9]+ 
                       |  x [0-9a-f]+ 
                     )
                )
           )
           ;     
      )

      # Add more here
 )