0

I have a PHP script that uses CURL to fetch the title and description of a user-entered URL and displays them on the page (which includes a utf-8 charset meta tag), and I'm having problems with characters not displaying correctly.

I read in this answer that the PHP CURL function encodes strings to utf-8 and that I need to decode strings with utf8_decode. But I'm finding that using utf8_decode is a hit or miss proposition -- sometimes it helps, sometimes, it creates unknown characters where there were none in the string before it was decoded.

I've included some examples below.

What's the proper way to handle encoding in this case?


Examples:

Here's the content fetched from a NY Times article with an emdash in the description. In this case, the decoded version displays the character properly:

enter image description here

Here's content from another NY Times article with an emdash in the description, and here, decoding made the character display improperly:

enter image description here

I'm finding that decoding causes problems with foreign language sites like this one in Spanish:

enter image description here

I know I can detect the language of the URL and decode or not based on that, but I'm finding plenty of English language sites where encoding causes problems, like this one:

enter image description here

Dave
  • 69
  • 2
  • 8
  • 1
    I don't know the `file_get_contents_curl` function and neither does php.net, it may be helpful to add it to your question for clarity sake – Dale Jan 25 '18 at 16:11
  • Sorry, I inadvertently included the name of a function I created in my script. I've edited the question accordingly. Thanks for pointing it out. – Dave Jan 25 '18 at 16:22

2 Answers2

1

After doing a lot more experimenting I stumbled on this solution, which fixed everything.

My script fetched the URL contents and loaded them into a DOM document like this:

$html = file_get_contents_curl($link_url);
$doc = new DOMDocument();
@$doc->loadHTML($html);

Per the linked article, I changed it to this:

$html = file_get_contents_curl($link_url);
$doc = new DOMDocument();
@$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

I also eliminated the use of utf8_decode.

And everything displayed properly.

Dave
  • 69
  • 2
  • 8
0

The server will enforce the page encoding and you have to decode according to that. You can get the page encoding in advance issuing a HEAD request. Look for charsetat Content-typeheader

curl --head https://www.nytimes.com/ HTTP/1.1 200 OK Server: Apache Cache-Control: no-cache X-ESI: 1 X-App-Response-Time: 0.70 Content-Type: text/html; charset=utf-8 X-PageType: homepage ... ...
Vary: Accept-Encoding, Fastly-SSL

LMC
  • 10,453
  • 2
  • 27
  • 52