UTF8 with file_get_contents()

Question

I'm using file_get_contents() to get HTML and scrap some data from a website. The source is not always UTF8 but I am using the FORCEUTF8 class to fix it. It doesn't work fine though. What am I doing wrong?

/* Load UTF8 HTML */
require_once('/ForceUTF8/Encoding.php');
use \ForceUTF8\Encoding;
function loadHTMLInUtf8($url){
$utf8_or_latin1_or_mixed_string=file_get_contents($url);
return Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
}    

$html=loadHTMLInUtf8('http://www.example.com/');
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Is there an alternative way of doing this?

possible duplicate of [PHP DomDocument failing to handle utf-8 characters (☆)](http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters-%e2%98%86) — jabbink, Dec 08 '14 at 20:22

score 2 · Answer 1 · answered Dec 08 '14 at 18:57

2

You can use the method "utf8_encode". It should do the same as the written method above.

answered Dec 08 '14 at 18:57

jan

142
2
2
9

Thanks a lot for your answer, but what exactly the difference between this and the answer above? – Álvaro N. Franz Dec 08 '14 at 19:14
1

@Alberich this was first. – Forien Dec 08 '14 at 19:18
Thank you a lot, the solution is below. Enjoy your day and thanks for helping :) – Álvaro N. Franz Dec 08 '14 at 20:13

score 1 · Accepted Answer · answered Dec 08 '14 at 18:58

1

file_get_contents is known to destroy UTF8 encoding.

Try something like this:

<?php
function file_get_contents_utf8($fn) {
    $content = file_get_contents($fn);
    return mb_convert_encoding($content, 'UTF-8',
        mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>

If this does not work, could you please give an example URL, where this does not work? (I checked the source of the FORCEUTF8 library, and that does not look very efficient and I guess, this small function could do the same (and it's native in the PHP-code)).

answered Dec 08 '14 at 18:58

jabbink

1,271
1
8
20

Thank you very much for your nice and complete answer. It is not working now though, with this example: http://www.zoomnews.es/468680/al-dente/pequeno-nicolas-quiso-montar-las-juventudes-faes – Álvaro N. Franz Dec 08 '14 at 19:10
It keeps saving the title like this: "El 'pequeÃ±o NicolÃ¡s' q..." :) – Álvaro N. Franz Dec 08 '14 at 19:11
1

@Alberich it shows good to me, be sure to clean your browser cache or use "incognito" mode. – Forien Dec 08 '14 at 19:12
1

Ok, which DOM HTML parser is `$dom`? Because possibly that is the problem (if I just echo the data of the `..._utf8` function the bytes are correct). – jabbink Dec 08 '14 at 19:14
I used $dom = new DOMDocument("4.01", "utf-8"); :) – Álvaro N. Franz Dec 08 '14 at 19:15
1

Thank you very much, it is perfect :) Have a wonderful day :) – Álvaro N. Franz Dec 08 '14 at 20:11

UTF8 with file_get_contents()

2 Answers2