dealing with character encoded twitter responses

Question

Im building an application that interacts with the Twitter API.

So far my code handles the responses correctly and I am happy with the way i am interacting with search API. I am however stuck when it comes to the actual content from the Twitter API responses.

Right now, i search for tweets with specific hastags using the atom feed, i.e.

$url = 'http://search.twitter.com/search.atom?q='.urlencode($hash_tag) ;
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, TRUE);
$xml = curl_exec ($ch);
curl_close ($ch);

$twelement = new SimpleXMLElement($xml);

echo "<pre>";
foreach ($twelement->entry as $entry) {

echo($entry->author->name);
echo '<br />';
echo mb_detect_encoding($entry->author->name);
echo '<br />';

I have been trying different php functions to decode/convert to the correct character encoding, but no matter what i do, i always end up with the wrong output.

My output from this code is : (crossed out for privacy)

xxxxxx (xxxxx xxxxxxx)
ASCII

xxxx_xxxxx (Chinny â™¥_â™¥)
UTF-8

kunlemyk ((Ë˜Ì¯Ë˜ ) hardekhunleyâ„¢)
UTF-8

xxxx_xxxxx (â™¥ify okwuosaâ™¥)
UTF-8

xxx_xxxx (Call me DRO)
ASCII

Why are some ASCII and some UTF-8? how can i ensure they are consistent. can i convert them to ascii? im pretty lost here. I have been stuck on this for ages and would really appreciate some help here.

Regards,

Andrew

The "steps" are called encoding. Just take care you preserve and signal it properly. That's all. — hakre, Jun 03 '12 at 13:39
What output encoding are you using on your page? If it's UTF-8, it should work without any additional functions (especially remove the `utf8_decode()` call). — Pekka, Jun 03 '12 at 13:39
im using a magento community 1.7 edition, i guess it would be UTF-8. if i remove the above functions and simply echo the output, it still contains characters i do not recognise. — activeDev, Jun 03 '12 at 13:47

goat · Accepted Answer · 2012-06-03T15:26:28.797

utf8 was specifically designed so that ascii was a proper subset of it. This was done for backwards compatibility.

a function that detects an encoding, usually does so by educated guessing after inspecting the byte values. If the string in question contains nothing but ascii characters, it could be called either ascii, or utf8. Again, this is because an ascii string is a valid utf8 string by design.

It makes more sense to call a pure ascii string "ascii", because it is more specific, and when guessing, you only really know for sure that it's ascii if all you've encountered was ascii chars. If there was at least one utf8 character in the string, and the rest were ascii, the func should detect it as utf8. But without seeing at least one utf8 char, it would be wrong to call the string utf8.

edit- as for what to do about it? Again, an ascii string is a valid utf8 string, so you should just use utf8 as that will work for both types. make sure to declare this via a real http header, not a <meta tag.

header('content-type:text/html;charset=utf-8');

score 0 · Answer 2 · edited May 23 '17 at 12:28

0

Take a loot at this post.

You might want to search for methods to detect encoding.

edited May 23 '17 at 12:28

Community

1
1

answered Jun 03 '12 at 14:20

Bamdad Dashtban

354
3
17

dealing with character encoded twitter responses

2 Answers2