PHP replace less than inside a HTML string

Question

Assuming you have a string <div>some text with symbol < inside </div>, How can I replace the < inside with < without touching the less-than of the div tag ?

This is just an exemple, the string could be larger and have more than one < occurence.

Expected result : <div>some text with symbol < inside </div>

What is the purpose? Are you going to store these in database? — revo, Mar 10 '16 at 18:56
You need to provide more detail: example inputs, the expected results for each input, the code you've tried and what you're actually getting as a result of your code. — Nathaniel Ford, Mar 10 '16 at 18:57
I need it to fix a bug. The other program think that < is a beginning of a tag — Thanh Trung, Mar 10 '16 at 18:58
You need to come up with a set of rules for when the `<` should be replaced and when it shouldn't. What should happen if the text is `
some text with symbol `? What about if it is `
some text with symbol
` — Patrick Q, Mar 10 '16 at 19:00
@PatrickQ this is not the case. The user is doing mathematical inside a string ad etc... — Thanh Trung, Mar 10 '16 at 19:02
@noob I don't want to replace "all" `<` only those inside a tag. DOM parser is killing performance; I thought we could use preg_replace? — Thanh Trung, Mar 10 '16 at 19:06

score 2 · Answer 1 · edited May 23 '17 at 12:23

2

Before you go any further:

Quoting from RegEx match open tags except XHTML self-contained tags :

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. [...] Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Here's a step-by-step solution to solve your issue:

Use a XML parser. If you only have the full HTML;
Use htmlspecialchars() or htmlentities() on the content.

I won't explain how to do this, since there's already loads of articles on Google about this subject.

And, please, STOP using regular expressions to handle HTML!

edited May 23 '17 at 12:23

Community

1
1

answered Mar 10 '16 at 19:07

Ismael Miguel

4,185
1
31
42

Ty, I'll try to use a parser if that's the only option – Thanh Trung Mar 10 '16 at 19:12
1

@ThanhTrung It isn't the only option. You have plenty of options. You can make a nearly perfect regular expression that took you 10 hours to perfect, but then you come across the most bizarre piece of **valid** HTML and you just watch your code sink itself. Or you can use a rock-solid XML parser that was tested over the years by many and perfected over all those years. Which one do you prefer? You have the option to pick whichever you want, but do you want to take risks? Regular expressions are meant for pattern matching, not parsing. – Ismael Miguel Mar 10 '16 at 19:16
I don't know. But if XML parser knows which one is a tag, so cannot a regex? – Thanh Trung Mar 10 '16 at 19:23
It can, if you are able to define **what** is a tag. And don't forget: XML/HTML is structured. The meaning of something may change depending on where it is. `<` cannot be parsed in **any** way with a regular expression. Also, is `< text>` a tag? Isn't it? What about self-closing tags? What about ``? I can give more examples of XML/HTML that can't be handled with regular expressions. – Ismael Miguel Mar 10 '16 at 19:27

C.Liddell · Answer 2 · 2016-03-10T20:47:28.977

1

This should work:

$html = preg_replace('/(?!<[a-zA-Z=\"\':; ]*[^ ]>|<\\/[a-zA-Z="\':; ]*>)(<)/', "&lt;", $html);

Edit: Though I would recommend doing what @Ismael Miguel suggested, if your wanting to do this purely with regexes, I've modified the above to work.

edited Mar 10 '16 at 20:47

answered Mar 10 '16 at 19:00

C.Liddell

1,084
10
19

2

`<[a-zA-Z]*>` will match
but not `
`
– Greg Smirnov Mar 10 '16 at 19:06
@GregSmirnov You're right. I modified it as required. – C.Liddell Mar 10 '16 at 19:17
But it doesn't replace whatever after "Possible to have math expression", please test on the link I gave above – Thanh Trung Mar 10 '16 at 19:20
@ThanhTrung Your requirements on the question were to replace, in a piece of HTML, the character `<` for it's HTML entity. This does exactly it. – Ismael Miguel Mar 10 '16 at 19:22
If you read clearly : this is an example, "the string could be larger and have more than one < occurence." – Thanh Trung Mar 10 '16 at 19:24
@ThanhTrung I made a final revision see if it works now. – C.Liddell Mar 10 '16 at 20:36

Greg Smirnov · Answer 3 · 2016-03-10T19:50:31.217

When you definitely know that there are no other tags inside the divs, you may use this snippet:

$html = '<div class="toto">some <div>text</div> with symbol < inside. Possible to have math expression < and > . </div><div> 4 < 5 > 2</div>';

$html = preg_replace_callback( '#(<div[^>]*>)(.*)(<\/div>)#Ui',
        function ($matches) { return $matches[1] . htmlentities($matches[2]) . $matches[3]; },
        $html);

echo $html;

// <div class="toto">some &lt;div&gt;text</div> with symbol < inside. Possible to have math expression < and > . </div><div> 4 &lt; 5 &gt; 2</div>

PHP replace less than inside a HTML string

3 Answers3