4

I have some data I collect with DomCrawler and store in an array, but it looks like he fails when it comes to special characters like è,à,ï,etc.

As an example I get è instead of è when I echo the result.

When I store my results in a .json file I get this: \u00c3\u00a8 My goal is to save the special character in the .json file.

I've tried encoding it but doesn't seem to have the result I want.

$html = file_get_contents($url);
$crawler = new Crawler($html);

$h1 = $crawler->filter('h1');
$title = $h1->text();
$title = mb_convert_encoding($title, "HTML-ENTITIES", "UTF-8");

Is there anyway I can have my special characters shown?

Thanks a lot!

Frank Lucas
  • 582
  • 2
  • 12
  • 28

1 Answers1

0

By using the constructor to add the HTML, the crawler assume that it is in ISO-8859-1. You have to explicitly tell it that your DOM is in UTF-8 with the addHTMLContent method:

$html = file_get_contents($url);
$crawler = new Crawler;
$crawler->addHTMLContent($html, 'UTF-8');
Tmb
  • 450
  • 12
  • 20
  • I've tried your answer and I still get `\u00e8` in my json, unfortunately. – Frank Lucas Mar 25 '16 at 08:47
  • @FrankLucas Try to change the second argument of the `addHTMLContent`, maybe with ISO-8859-1? – Tmb Mar 25 '16 at 08:50
  • @ThomsMauduit-Blin everything stays the same :( – Frank Lucas Mar 25 '16 at 08:54
  • @FrankLucas If you print `$html` before using the Crawler, are the chars good ? – Tmb Mar 25 '16 at 08:57
  • @ThomsMauduit-Blin no they are not good I get `è` in my browsers instead of `è` – Frank Lucas Mar 25 '16 at 09:00
  • @FrankLucas So the problem does not resides in the Crawler, but when you get the source. Try to make some operations on your `$html` before using it. Maybe you can find an answer here: http://stackoverflow.com/questions/2236668/file-get-contents-breaks-up-utf-8-characters#answer-15183803 – Tmb Mar 25 '16 at 09:25
  • ThomsMauduit-Blin I've already tried doing this: `$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");` doesn't work either... – Frank Lucas Mar 25 '16 at 09:35