22
$string = file_get_contents('http://example.com');

if ('UTF-8' === mb_detect_encoding($string)) {
    $dom = new DOMDocument();
    // hack to preserve UTF-8 characters
    $dom->loadHTML('<?xml encoding="UTF-8">' . $string);
    $dom->preserveWhiteSpace = false;
    $dom->encoding = 'UTF-8';
    $body = $dom->getElementsByTagName('body');
    echo htmlspecialchars($body->item(0)->nodeValue);
}

This changes all UTF-8 characters to Å, ¾, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?

Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.

Thanks in advance :)

Charles
  • 50,943
  • 13
  • 104
  • 142
Richard Knop
  • 81,041
  • 149
  • 392
  • 552
  • 2
    Where does the data (`$string`) come from? – Pekka Feb 10 '10 at 13:01
  • Can you provide a link to the URL you fetch using file_get_contents()? As I said in the other question, I suspect you are getting ISO-8859-1 or some other data, which *has* to get garbled when output in UTF-8. I wouldn't rely on mb_detect_encoding(). – Pekka Feb 10 '10 at 13:09
  • Sure, here is the link: http://www.futbalvsfz.sk/sutaze/sezona-2009-2010/dospeli/5.liga-jz – Richard Knop Feb 10 '10 at 13:12
  • Okay, I am convinced :) this is really strange. However, the default encoding to `htmlspecialchars()` is `iso-8859-1`. Can you change that to UTF? It shouldn't change anything but just to makr sure. http://de3.php.net/htmlspecialchars – Pekka Feb 10 '10 at 13:16
  • Is your browser set to UTF-8? :) – Petr Peller Feb 10 '10 at 13:33
  • @Pekka That's surely not the problem. I also tried displaying it without htmlspecialchars() or saving it to a file. – Richard Knop Feb 10 '10 at 15:02
  • So if you output `$string` without any DOM processing, it comes out fine? It's definitely the DOM screwing it up? – Pekka Feb 10 '10 at 15:03
  • @Pekka Btw, it works on my local pc with WampServer on Windows 7. It doesn't work on the server online though. – Richard Knop Feb 10 '10 at 15:05
  • @Pekka Yes. If I put the echo() before the DOM processing it's ok with all Utf-8 chars. If I put it after the DOM parsing, it's all messed up. – Richard Knop Feb 10 '10 at 15:07
  • Really, really strange. DOMDocument is supposed to be native utf-8... Try my answer below, maybe it helps. – Pekka Feb 10 '10 at 15:08
  • I still think the problem is lack of a charset declaration. php is probably sending the default content-type of text/html, without a charset. This makes the browser guess what the charset is. if the html contains a meta tag, it will use it. the html from the remote url has a meta tag, so echo $string; is going to output the meta tag. Browser sees utf-8, and uses it, all is well. But when echo $dombody, no meta tag is output. browser guesses wrong charset, and the wrong characters are interpreted by browser. – goat Feb 10 '10 at 17:55
  • The page actually contains meta tag with UTF-8 content type. – Richard Knop Feb 10 '10 at 20:37
  • And the browser will ignore the meta tag if an http header was sent that specified an encoding. Like I said, you need to send an http header declaring the encoding. http headers take precedence. – goat Feb 17 '10 at 23:17
  • possible duplicate of [PHP DomDocument saveHTML not encoding Japanese correctly](http://stackoverflow.com/questions/8218230/php-domdocument-savehtml-not-encoding-japanese-correctly) – cmbuckley Dec 24 '12 at 00:08
  • possible duplicate of [DOMDocument breaks encoding?](http://stackoverflow.com/questions/12676983/domdocument-breaks-encoding) – Ja͢ck Dec 24 '12 at 01:47

4 Answers4

43

I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html

$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);
andrewmabbott
  • 779
  • 6
  • 8
  • 1
    This is a great workaround but it would still be interesting to find out why your production server's DOM screws up the UTF8 in the first place. Maybe something to ask the administrator, if there is one. – Pekka Feb 10 '10 at 16:27
  • I am the administrator :D and I have no idea. I am using a very common set up of Debian 5.0 Lenny. Maybe it's some security "feature" that does this? – Richard Knop Feb 10 '10 at 16:30
  • Furthermore, I'm using the default php5 package for Debian from official repositories, so it's the default installation with default settings. I haven't changed any default settings, I just added few extensions I need for my applications like ioncube, imagick, gd, curl (I think that's all of them). – Richard Knop Feb 10 '10 at 16:40
  • @Pekka loadHtml() doesn't work with UTF-8 also for me (only loadXml(), however it doesn't work well with document fragments - loadXml() needs properly formatted documents as against loadHtml()). My libxml version is 2.6.32 (hungarian Windows XP SP3). – István Ujj-Mészáros Nov 17 '10 at 21:45
  • this also works for passing mysql utf-8 content to extract php function. i was having problems with mysql data passed to dompdf and this was the resolving. many thanks! – machineaddict Mar 27 '12 at 08:51
4

In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument should be UTF-8 by default in any case but you can still try:

    $dom = new DOMDocument('1.0', 'utf-8');
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
2

I had to add a utf8 header to get the correct view:

header('Content-Type: text/html; charset=utf-8');
fty4
  • 568
  • 9
  • 18
1

At the top of the script where your php code lies(the code you posted here), make sure you send a utf-8 header. I bet your encoding is a some variant of latin1 right now. Yes, I know the remote webpage is utf8, but this php script isn't.

goat
  • 31,486
  • 7
  • 73
  • 96