260

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

What I see:

ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åº­ã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ã­ã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å­¦ã

What should be shown:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åº­ã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
cmbuckley
  • 40,217
  • 9
  • 77
  • 91
Slightly A.
  • 2,795
  • 2
  • 16
  • 10

11 Answers11

680

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

If you're using DOMDocument to load HTML5, you might want to look at alternative solutions: How to make HTML5 work with DOMDocument?

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
  • Tried with no success. Also, UTF-8 is the default for the constructor in PHP5, so should not be necessary, anyway. – Slightly A. Nov 21 '11 at 21:22
  • 3
    Yes, that did it. Thank you for your help. I tried saveHTML, saveXML, didn't think that the problem may have been coming during the load. – Slightly A. Nov 21 '11 at 21:34
  • 6
    The mb_convert_encoding call worked for me, whereas prepending the encoding declaration didn't. Likely because the document already had a conflicting declaration. Many thanks - saved me a lot of time chasing this down. – Peter Bagnall Jul 04 '13 at 12:43
  • 4
    `$dom->loadHTML('' . $content);` fixed it for me in PHP7 (so it is still an issue) - this is a really annoying problem, because I defined utf8 in the HTML document (with ``) but that has no effect, it seems to need the – iquito Apr 20 '16 at 14:00
  • 13
    Still in 2017 this answer is relevant and worked for me too. I had my database, multibyte, html meta tag and DOM encoding all set to utf8 and still had bad encoding on importing node from one DOC to another. http://php.net/manual/en/function.mb-convert-encoding.php was the fix. – Louis Loudog Trottier Mar 06 '17 at 21:43
  • 1
    Using 'HTML-ENTITIES' is a horrible hack, but in october 2017, that is the only trick that is suggested on this page that works on RHEL7! – Free Radical Oct 18 '17 at 14:38
  • 12
    `$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));` works great! Thank you, – vee Mar 08 '18 at 06:05
  • Prepending the xml encoding declaration to the string worked great. Opted for this solution vs the mb_convert_encoding. Thanks. – Mike Purcell May 17 '18 at 19:43
  • 1
    So what is HTML-ENTITIES ? Is this some kind of constant? What does this have to do with aformentioned ISO-8859-1 (ASCII) encoding? I know about the htmlentities function. BTW I still have the error. – Adamantus Jan 11 '19 at 23:04
  • It's one of the [supported encodings](http://php.net/manual/en/mbstring.supported-encodings.php) used by mbstring. The role of the conversion is to change the byte representation of the characters — e.g. "イ" will be represented with different bytes depending on the encoding — and the HTML-ENTITIES "encoding" is just a special example of that. [Here’s a gist](https://gist.github.com/cmbuckley/495da67b60e0453a7f18eb5060243566) that shows what's happening. ISO-8859-1 is the [default encoding for HTTP text documents according to the standard](https://en.wikipedia.org/wiki/ISO/IEC_8859-1#History). – cmbuckley Jan 12 '19 at 18:56
  • If the my data is already converted to strange charecters does it mean I have lost it – Sboniso Marcus Nzimande Sep 17 '19 at 09:39
  • Depends what you mean - if you have stored some characters and they're reading as the wrong encoding, and you're seeing something like OP's example, then it's most likely recoverable, by looking at the hex encoding of the strange characters and using a conversion chart [such as this one for é](http://www.fileformat.info/info/unicode/char/00e9/charset_support.htm). If the strange characters have been replaced with `?` or �, then you won't be able to recover the original. – cmbuckley Sep 18 '19 at 11:06
  • This saveHTTML, display correctly in browser, but not for actual string. @Greeso answer is better. – neobie May 25 '20 at 04:18
  • thank you so much! the mb_convert_encoding solution worked great for me... – bhu Boue vidya Dec 31 '20 at 02:23
  • 10 years ago you gave the solution. I guess that is the problem of using libraries made for html4, any better recent alternatives? – billybadass Apr 06 '21 at 13:49
  • 1
    Becareful! use `mb_convert_encoding()` will convert non-english text to entities. For example: `ส` will becomes `ส`. – vee Jan 07 '22 at 22:56
  • Yes, that’s the point, since those characters can’t be represented in ISO-8859-1. – cmbuckley Jan 07 '22 at 23:02
  • 1
    If string doesn't contain `` with character set, I recommended use `mb_convert_encoding()` as in this **answer** but also use `$dom->saveHTML($dom->documentElement)` to make the output as unicode text **NOT** HTML entities. Thank you @cmbuckley . – vee Jan 08 '22 at 00:13
  • This: `$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));` worked for me. Saved my life! – Designly May 14 '23 at 19:44
94

The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

The workaround is very simple:

If you try the default, you will get the error you described

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().


Update

As suggested by "Jack M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:

$str = utf8_decode($dom->saveHTML($dom->documentElement));

Note

  1. English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)

  2. The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

Greeso
  • 7,544
  • 9
  • 51
  • 77
  • 8
    I had to utf8_decode while using this solution. Thanks! – Jack M. Sep 08 '14 at 23:48
  • What do you mean by utf8_decode? I apologize but I did not understand what you mean. – Greeso Sep 10 '14 at 13:15
  • 14
    This had to become utf8_decode($dom->saveHTML(dom->documentElement)) to preserve my special characters. Otherwise, they just became something else. Just mentioning it in case it helps someone else. – Jack M. Sep 10 '14 at 13:52
  • 6
    Thanks @MrJack. I also had to do the same to make it display without the strange characters `$str = utf8_decode($dom->saveHTML($dom->documentElement));` – Pamela Jan 15 '16 at 11:34
  • 3
    `utf8_decode($dom->saveHTML($dom->documentElement));` did it perfectly for me. – Marco Aurélio Deleu Oct 20 '16 at 22:48
  • This is not working for me. My Thai language still becomes **สวัสà¸à¸µà¸ าษาà¹à¸à¸¢**. – vee Jan 18 '19 at 06:08
  • @vee - Did you try what "Jack M" suggested in his comment above? Both "Pamela" and "Marco Aurélio Deleu" followed his suggestion and it worked for them. – Greeso Jan 18 '19 at 13:55
  • If that so, could you please update your answer to this one that they confirmed work. :) – vee Jan 18 '19 at 15:53
  • not work correctly - better is `mb_convert_encoding()` – Bruno Jun 03 '19 at 23:52
  • Great work, @Greeso, thanks! I had problems with `$dom -> saveXML()` where all extended Latin characters were being converted into HTML escapes and those HTML escapes were _then_ converted into XML escapes. Needless to say, strings like `Stani&#x10D;n&#xE9; n&#xE1;mestie` aren't much use in XML (or any data) Thanks to your suggestion, I replaced `$dom -> saveXML();` with `$dom -> saveXML($dom -> documentElement);`. Everything is now fixed and I now see `Staničné námestie` correctly encoded in UTF-8. Thank you. – Rounin Jan 26 '23 at 14:50
  • 1
    @Rounin-StandingwithUkraine Well wow, It is going to be 10 years since I wrote this answer, glad it is still relevant. – Greeso Jan 31 '23 at 01:39
21

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

Also in case of HTML, make sure you have declared the correct encoding using meta tags:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

Hossein
  • 4,097
  • 2
  • 24
  • 46
  • I understand what you're saying, but I have no problems displaying the characters. if I do "echo $profile;" it works fine. it's when the DomDocument gets ahold of it that it starts failing. – Slightly A. Nov 21 '11 at 21:08
  • 2
    Your meta prevents saveHTML from encoding everything above ASCII into entities. The solution I was looking for :) – sod Jun 28 '13 at 13:32
  • 3
    As a side note, the newer `` tag doesn't work with DOMDocument. – Taylan Oct 16 '15 at 15:23
  • 1
    @Taylan: no problem at all with ``: see https://3v4l.org/AATjh – Casimir et Hippolyte Oct 17 '20 at 19:50
18

This took me a while to figure out but here's my answer.

Before using DomDocument I would use file_get_contents to retrieve URLs and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, PHP settings, and all the rest of the remedies offered here and elsewhere. Here's what works:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world.

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • Just wanted to add to my answer above that another way to address this is with the following, suggested elsewhere as well: if ($dom->loadHTML('' . $str) == false). After posting my answer I found an occasion where my first suggestion failed but the second worked. –  Nov 20 '17 at 16:14
  • Works for me even without the params in `DomDocument('1.0', 'UTF-8')`. But in my case only partial html is loaded. – JKB Jun 17 '20 at 13:59
  • thanks a lot man, worked for me dealing with hebrew – Sagive Dec 28 '21 at 10:29
13

Use correct header for UTF-8

Don't get satisfied by "it works".

@cmbuckley in his accepted answer advised to set <?xml encoding="utf-8" ?> to the document. However to use XML declaration in HTML document is a bit weird. HTML is not XML (unless it is XHTML) and it can confuse browsers and other software on the way to client (may be source of the failures reported by others).

I successfully used HTML5 declaration:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $profile);
echo $dom->saveHTML();

If you use other standard, use correct header, the DOMDocument follows the standards quite pedantically and seems to support HTML5, too (if not in your case, try to update the libxml extension).

Jan Turoň
  • 31,451
  • 23
  • 125
  • 169
  • 2
    There is no support for HTML5 in PHP, unfortunately, because libxml doesn't support it. You'd get the same results with ` `, i.e. it would just output whatever you typed. – miken32 Dec 23 '21 at 18:00
  • I'm running PHP 8.1.0 on Windows and adding only the tag works fine for me. No need to use neither – MMJ Mar 25 '22 at 22:24
12

You could prefix a line enforcing utf-8 encoding, like this:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

$doc->saveXML()
trincot
  • 317,000
  • 35
  • 244
  • 286
Ivan
  • 2,316
  • 2
  • 24
  • 22
5

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
5

Use it for correct result

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

Alexander Goncharov
  • 1,572
  • 17
  • 20
4

Works finde for me:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());
mMo
  • 233
  • 2
  • 10
3

The only thing that worked for me was the accepted answer of

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

HOWEVER

This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.

The solution for me was then to do

foreach ($doc->childNodes as $xx) {
    if ($xx instanceof \DOMProcessingInstruction) {
        $xx->parentNode->removeChild($xx);
    }
}

Some solutions told me that to remove the xml header, that I had to perform

$dom->saveXML($dom->documentElement);

This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.

Luke Madhanga
  • 6,871
  • 2
  • 43
  • 47
-1

The problem is that when you add a parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.

I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help people who may be in my case.

xKobalt
  • 1,498
  • 2
  • 13
  • 19
copndz
  • 1,104
  • 2
  • 12
  • 23