Corrupted UTF-8 encoding when reading Google feed / alerts

Question

Whenever I try to read a Google alert via PHP using something like:

$feed = file_get_contents("http://www.google.com/alerts/feeds/01445174399729103044/950192755411504138");

Regardless of whether I save the $feed to a file or echo the result to the output, all utf-8 unicode characters ( i.e. those with diacritics) are represented by white space. I have tried - without success - various combinations of:

utf8_encode
utf8_decode
iconv
mb_convert_encoding

I think the wrong characters have come from the stream, but I'm lost because if I try this URI in a browser then everything is fine. Can anyone shed some light on the issue?

The feed is already `utf-8` encoded, what [character-set are you specifying in your response / meta](http://stackoverflow.com/questions/4279282/set-http-header-to-utf-8-using-php)? — Emissary, Aug 05 '14 at 20:20
The stream comes from Google. I save the string ($feed) directly to disk as a plain text file. There are no utf8 chars left. I tried it on different servers. Please try it too. Thanks. — René Teinze, Aug 05 '14 at 23:01
It's not clear what you are trying to do? If you are simply copying the feed verbatim and dumping the result into a file then you shouldn't need to do anything with the string. *PHP* won't care about the data that is simply *"passing through"* - it sounds more like you are having an issue with the application that you are using to view that text file afterwards. — Emissary, Aug 06 '14 at 06:17
The encodings and decodings were the desperate attempt to solve the problem. I use a coding text editor to view the file. Did you try it too? I would be very grateful. — René Teinze, Aug 06 '14 at 06:58

score 0 · Accepted Answer · answered Aug 06 '14 at 10:48

Sorry, you are absolutely correct - there is something untoward happening! Though it is not what you would first suspect... For reference, given that:

echo mb_detect_encoding($feed); // prints: ASCII

The unicode data is lost before it is even sent by the remote server - it appears that Google is looking at the user-agent string in the request header - which is non-existent using file_get_contents by default without a stream-context.

Because it cannot identify the client making the request it defaults to and forces ASCII encoding. This is presumably a necessary fallback in the event of some kind of cataclysmic cock-up. ^{[citation needed...]}

It's not simply enough to name your application however, you need to include a known vendor. I 'm unsure of the full extent of this but I believe most folks include "Mozilla [version]" to work around the issue, for example:

$url = 'http://www.google.com/...';

$feed = file_get_contents($url, false, stream_context_create([
    'http' => [
        'method' => 'GET',
        'header' => 'Accept-Charset: UTF-8' ."\r\n"
                   .'User-Agent: (Mozilla/5.0 compatible) MyFeedReader/1.0'
    ]
]));

file_put_contents('test.txt', $feed); // should now work as expected

You are a hero! Thank you very much. :) – René Teinze Aug 06 '14 at 17:28 — René Teinze, Aug 06 '14 at 17:28

Corrupted UTF-8 encoding when reading Google feed / alerts

1 Answers1

Linked