0

I'm using file_get_contents() to get HTML and scrap some data from a website. The source is not always UTF8 but I am using the FORCEUTF8 class to fix it. It doesn't work fine though. What am I doing wrong?

/* Load UTF8 HTML */
require_once('/ForceUTF8/Encoding.php');
use \ForceUTF8\Encoding;
function loadHTMLInUtf8($url){
$utf8_or_latin1_or_mixed_string=file_get_contents($url);
return Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
}    

$html=loadHTMLInUtf8('http://www.example.com/');
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Is there an alternative way of doing this?

Álvaro N. Franz
  • 1,188
  • 3
  • 17
  • 39
  • possible duplicate of [PHP DomDocument failing to handle utf-8 characters (☆)](http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters-%e2%98%86) – jabbink Dec 08 '14 at 20:22

2 Answers2

2

You can use the method "utf8_encode". It should do the same as the written method above.

jan
  • 142
  • 2
  • 2
  • 9
1

file_get_contents is known to destroy UTF8 encoding.

Try something like this:

<?php
function file_get_contents_utf8($fn) {
    $content = file_get_contents($fn);
    return mb_convert_encoding($content, 'UTF-8',
        mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>

If this does not work, could you please give an example URL, where this does not work? (I checked the source of the FORCEUTF8 library, and that does not look very efficient and I guess, this small function could do the same (and it's native in the PHP-code)).

jabbink
  • 1,271
  • 1
  • 8
  • 20