1

Given a DOMDocument constructed with a stylesheet that contains an emoji character like so:

$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><head><meta charset=utf-8><style>span::before{ content: \"⚡️\"; }</style></head><body><span></span></body></html>" );

I've found some strange behavior when serializing the DOM back out to HTML.

If I do $dom->saveHTML( $dom->documentElement ) then I get (as desired):

<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>

However, if I instead do $dom->saveHTML() to save the entire document I get (erroneously):

<html><head><meta charset="utf-8">
<style>span::before{ content: "&#9889;&#65039;"; }</style>
</head><body><span></span></body></html>

Notice how the emoji “⚡️” is encoded as the HTML entities &#9889;&#65039; inside of the stylesheet, and browsers do not like this and it is treated as a literal string since CSS escape \26A1 should be used in instead.

I tried setting $dom->substituteEntities = false but without any effect.

The same HTML entity conversion is also happening inside of script tags, which causes similar problems in browsers.

Test via online PHP shell: https://3v4l.org/jMfDd

Weston Ruter
  • 1,041
  • 1
  • 9
  • 21
  • One piece to the puzzle is that libxml doesn't seem to recognize ``. If I replace it with `` then the entity encoding is not performed. However, this still doesn't explain why the behavior differs between `$dom->saveHTML()` and `$dom->saveHTML( $dom->documentElement )`. – Weston Ruter Aug 02 '18 at 18:45
  • If you provide a node the result is a serialized fragment, not a whole document. That might be a reason for the different behavior. – ThW Aug 04 '18 at 17:16
  • I have [an answer](https://stackoverflow.com/a/59940487/1255289) for what's happening here. I can only guess at _why_ it's happening though. – miken32 Jan 28 '20 at 00:17

1 Answers1

0

You should convert the encoding before loading the HTML with emojis on DOMDocument:

$dom->loadHTML(mb_convert_encoding($htmlCode, 'HTML-ENTITIES', 'UTF-8'));

EDIT: As mention by post owner, mb_convert_enconding is deprecated in future PHP versions (currently tested on 8.2.5 and works fine). For later versions of PHP take a look at https://php.watch/versions/8.2/mbstring-qprint-base64-uuencode-html-entities-deprecated#html

Mariano Argañaraz
  • 1,159
  • 11
  • 22
  • That doesn't seem to work: https://3v4l.org/KZLZi I also get a notice: `Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead`. Again, the goal is to _retain_ the emoji in the output, not convert to entities. – Weston Ruter May 04 '23 at 18:28
  • @WestonRuter Updated answer based on your comment, I'm currently using this on production server on PHP 8.2.5, thanks for pointing the deprecation, it will be on my consideration for updates – Mariano Argañaraz May 04 '23 at 23:06
  • Ok, but it still doesn't seem the answer gets me the desired result. The goal is to have _no_ entities in the output. – Weston Ruter May 05 '23 at 04:14