0

Encoding is hell on earth for me. I must be really dumb.

I'm extracting hashtags from twitter to make my own bookmark library.

$url = 'https://api.twitter.com/1.1/statuses/mentions_timeline.json';
$requestMethod = 'GET';
$getfield = '?count=200&include_rts=1&max_id=397109847755210753';
$twitterGET = new TwitterAPIExchange($settingsGET);
$twitterPOST = new TwitterAPIExchange($settingsPOST);
$jsonString = $twitterGET->setGetfield($getfield)
         ->buildOauth($url, $requestMethod)
         ->performRequest();
$json_arr = json_decode($jsonString, true);

Since many twits are in spanish they have such characters as á

Twitter from what i have read is supposed to encode in UTF-8, but when i transform the hashtag strings to lower case im getting unicode stuff. See code below:

foreach ($json_arr as $mytwit) {
    $twitText=$mytwit["text"];
    $twitHashTags=$mytwit["entities"]["hashtags"];
    foreach($twitHashTags as $tag){
        $tag=mb_strtolower($tag, 'UTF-8');
        $twitKeyWords[]=$tag;
        echo $tag;
    }
    #==>outputs: tecnolog\u00edas
 }

So next i try to guess what encoding is there and i try this code with all possible encodings available on this lovely planet (below is just 1 of many attempts):

foreach($twitHashTags as $tag){
    $tag = iconv("ISO-8859-1", "UTF-8//IGNORE", $tag);
    $tag=mb_strtolower($tag, 'UTF-8');
    $twitKeyWords[]=$tag;
    echo $tag;
}
==>outputs: tecnolog\u00e3\u00adas (even worse, thanks)

I have 2 questions.

  1. If its conceptually impossible to guess the encoding of a string, why does twitter not specify the encoding of a twit in some field like for example $twit["entities"]["bloody_encoding"]?

  2. Does anybody have a php-twitter encoding advice for dummies?

Oh, i also tried this magic trick but didn't work unfortunately: How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

Community
  • 1
  • 1
fartagaintuxedo
  • 749
  • 10
  • 28
  • Is the message JSON? – Peter Bailey Aug 02 '16 at 18:11
  • Originally yes, its the JSON from the Twitter API, but i used a regex to extract some keywords because i use custom tags apart from the regular hashtags like for example i can have a tag that is `"_technology"` and even like this `"_computer technology;"` Note the `';'` at the end to catch tags containing spaces... – fartagaintuxedo Aug 02 '16 at 18:16
  • I will edit my question later today to show the code trough which i get the json and the text of the twit – fartagaintuxedo Aug 02 '16 at 18:21
  • 1
    This is not a problem of character encoding as in UTF-8 vs ISO-something, but of the JSON itself not being decoded properly. The \u... notation should not "survive" JSON decoding in the first place, you should have proper UTF-8 characters all the way through. – CBroe Aug 02 '16 at 18:39
  • @CBroe so then is there anything i can do apart from a manual work-around? – fartagaintuxedo Aug 02 '16 at 23:19
  • This does not need any workaround, if you just decode the JSON properly. The fact that you are messing around with it using regex as you said in a previous comment is most likely the root of your problem. Don’t do that. Decode it, and then iterate through the resulting data structure to find what you need. – CBroe Aug 03 '16 at 11:09
  • @CBroe I am decoding it, its coming with the unicode straight from the decoding, see this line in the code above `$json_arr = json_decode($jsonString, true);` i have tried without messing around and still unicode characters are showing. – fartagaintuxedo Aug 03 '16 at 13:13
  • Can you show some example JSON in its raw form as you get it from twitter? – CBroe Aug 03 '16 at 16:42
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/119043/discussion-between-fartagaintuxedo-and-cbroe). – fartagaintuxedo Aug 03 '16 at 19:00

1 Answers1

1

I think this is because Twitter is not sending you UTF-8 encoded data, it's sending ASCII-encoded (or similar) with unicode escape sequences

https://twittercommunity.com/t/is-it-normal-to-have-u-escaped-unicode-text-in-text-field-of-json-response-or-you-actually-retrieves-utf-8-code/13047

Can you give me some more details of what you're doing, such as which API call you're making and whether or not you're using an existing twitter client or SDK or if you rolled your own

Peter Bailey
  • 105,256
  • 31
  • 182
  • 206