1

I'm playing with Telegram bot development. The only thing in which i have no success is sending unicode characters.

The way i call the "sendMessage" api is in php with curl:

curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array("chat_id" => $chat_id, "text" => "\u2b50"));

The code above should post a star icon on the chat, but instead shows the exact text:

\u2b50

  • Escaping the text ("\\u2b50") doesn't work.
  • If the bot acts as an echo (replies with the received text) when typing "\u2b50" in the client, it replies with the star icon.
  • same behavior has for the keyboard keys (reply_markup.keyboard)

Thanks in advance

EDIT: solved with solution from bobince (thanks!).

used inline function like:

$text = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', hexdec($match[1])));
}, $text);

or

$text = preg_replace("/\\\\u([0-9a-fA-F]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", $text);
Charles Okwuagwu
  • 10,538
  • 16
  • 87
  • 157
mirko
  • 3
  • 1
  • 1
  • 6

2 Answers2

3

"\u2b50"

PHP string literal syntax doesn't have \u escapes, primarily because PHP strings are not Unicode-based, they're just a list of bytes.

Consequently if you want to include a non-ASCII character in a string you need to encode the character to bytes using whatever encoding the consumer of your output will be expecting.

If the Telegram web service is expecting to receive UTF-8 (and I've no idea if it is, but it's a good guess for any modern web app), then the UTF-8-encoded bytes for U+2B50 are 0xE2, 0xAD and 0x90, and so the string literal you should use is:

"\xE2\xAD\x90"

If you want to convert a Unicode codepoint to a UTF-8 string more generally:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

unichr(0x2B50)   // "\xE2\xAD\x90"
bobince
  • 528,062
  • 107
  • 651
  • 834
  • now, how convert unicode characters like "\ud83d\udc4e" (emoticon for thumb down). i found this thread (with your response) and tried all combinations, without success: http://stackoverflow.com/questions/2748956/how-would-you-create-a-string-of-all-utf-8-characters – mirko Jul 08 '15 at 14:37
  • 0xd83d, 0xdc4e are UTF-16 surrogate code units representing U+1F44E Thumbs Down, so `unichr(0x1F44E)`, which gives UTF-8 byte string `"\xF0\x9F\x91\x8E"`. – bobince Jul 09 '15 at 16:22
  • i'm impressed, but i did'n understand how to convert (0xd83d, 0xdc4e) into 0x1F44E... this post: http://stackoverflow.com/a/24763655/5091220 have functions i can use. thank you – mirko Jul 09 '15 at 20:10
  • Accepted answer is correct but make sure you put the code between double quotes, not single quotes! (I don't have enough reputation to comment and I thought this is important" ` $rocket .= "\xF0\x9F\x9A\x80"; //works $rocket .= '\xF0\x9F\x9A\x80'; // does not work` – godsaway Oct 07 '15 at 12:00
0

set the charset to unicode...

$headers = array(
           "Content-Type: application/x-www-form-urlencoded; charset: UTF-8"
        );
curl_setopt($ch, CURLOPT_POST, $headers );
curl_setopt($ch, CURLOPT_HEADER, array("chat_id" => $chat_id, "text" => "\u2b50"));
CommonKnowledge
  • 769
  • 1
  • 10
  • 35
  • no way to use "application/x-www-form-urlencoded" ==> "400 (Bad Request)" from server. the only accepted content-type is "multipart/form-data", but no luck with "Content-Type: multipart/form-data; charset: UTF-8", already tried =( – mirko Jul 07 '15 at 21:14