0

Assuming you have a string <div>some text with symbol < inside </div>, How can I replace the < inside with &lt; without touching the less-than of the div tag ?

This is just an exemple, the string could be larger and have more than one < occurence.

Expected result : <div>some text with symbol &lt; inside </div>

Thanh Trung
  • 3,566
  • 3
  • 31
  • 42

3 Answers3

2

Before you go any further:

Quoting from RegEx match open tags except XHTML self-contained tags :

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. [...] Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Here's a step-by-step solution to solve your issue:

  1. Use a XML parser. If you only have the full HTML;
  2. Use htmlspecialchars() or htmlentities() on the content.

I won't explain how to do this, since there's already loads of articles on Google about this subject.

And, please, STOP using regular expressions to handle HTML!

Community
  • 1
  • 1
Ismael Miguel
  • 4,185
  • 1
  • 31
  • 42
  • Ty, I'll try to use a parser if that's the only option – Thanh Trung Mar 10 '16 at 19:12
  • 1
    @ThanhTrung It isn't the only option. You have plenty of options. You can make a nearly perfect regular expression that took you 10 hours to perfect, but then you come across the most bizarre piece of **valid** HTML and you just watch your code sink itself. Or you can use a rock-solid XML parser that was tested over the years by many and perfected over all those years. Which one do you prefer? You have the option to pick whichever you want, but do you want to take risks? Regular expressions are meant for pattern matching, not parsing. – Ismael Miguel Mar 10 '16 at 19:16
  • I don't know. But if XML parser knows which one is a tag, so cannot a regex? – Thanh Trung Mar 10 '16 at 19:23
  • It can, if you are able to define **what** is a tag. And don't forget: XML/HTML is structured. The meaning of something may change depending on where it is. `<` cannot be parsed in **any** way with a regular expression. Also, is `< text>` a tag? Isn't it? What about self-closing tags? What about ``? I can give more examples of XML/HTML that can't be handled with regular expressions. – Ismael Miguel Mar 10 '16 at 19:27
1

This should work:

$html = preg_replace('/(?!<[a-zA-Z=\"\':; ]*[^ ]>|<\\/[a-zA-Z="\':; ]*>)(<)/', "&lt;", $html);

Edit: Though I would recommend doing what @Ismael Miguel suggested, if your wanting to do this purely with regexes, I've modified the above to work.

C.Liddell
  • 1,084
  • 10
  • 19
0

When you definitely know that there are no other tags inside the divs, you may use this snippet:

$html = '<div class="toto">some <div>text</div> with symbol < inside. Possible to have math expression < and > . </div><div> 4 < 5 > 2</div>';

$html = preg_replace_callback( '#(<div[^>]*>)(.*)(<\/div>)#Ui',
        function ($matches) { return $matches[1] . htmlentities($matches[2]) . $matches[3]; },
        $html);

echo $html;

// <div class="toto">some &lt;div&gt;text</div> with symbol < inside. Possible to have math expression < and > . </div><div> 4 &lt; 5 &gt; 2</div>
Greg Smirnov
  • 1,592
  • 9
  • 9