Encoding is hell on earth for me. I must be really dumb.
I'm extracting hashtags from twitter to make my own bookmark library.
$url = 'https://api.twitter.com/1.1/statuses/mentions_timeline.json';
$requestMethod = 'GET';
$getfield = '?count=200&include_rts=1&max_id=397109847755210753';
$twitterGET = new TwitterAPIExchange($settingsGET);
$twitterPOST = new TwitterAPIExchange($settingsPOST);
$jsonString = $twitterGET->setGetfield($getfield)
->buildOauth($url, $requestMethod)
->performRequest();
$json_arr = json_decode($jsonString, true);
Since many twits are in spanish they have such characters as á
Twitter from what i have read is supposed to encode in UTF-8
, but when i transform the hashtag strings to lower case im getting unicode stuff. See code below:
foreach ($json_arr as $mytwit) {
$twitText=$mytwit["text"];
$twitHashTags=$mytwit["entities"]["hashtags"];
foreach($twitHashTags as $tag){
$tag=mb_strtolower($tag, 'UTF-8');
$twitKeyWords[]=$tag;
echo $tag;
}
#==>outputs: tecnolog\u00edas
}
So next i try to guess what encoding is there and i try this code with all possible encodings available on this lovely planet (below is just 1 of many attempts):
foreach($twitHashTags as $tag){
$tag = iconv("ISO-8859-1", "UTF-8//IGNORE", $tag);
$tag=mb_strtolower($tag, 'UTF-8');
$twitKeyWords[]=$tag;
echo $tag;
}
==>outputs: tecnolog\u00e3\u00adas (even worse, thanks)
I have 2 questions.
If its conceptually impossible to guess the encoding of a string, why does twitter not specify the encoding of a twit in some field like for example
$twit["entities"]["bloody_encoding"]
?Does anybody have a php-twitter encoding advice for dummies?
Oh, i also tried this magic trick but didn't work unfortunately: How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?