I have a problem with telegram bot api. I'm trying to extract a URL from a message. It is written in the MessageEntity type that the offset and length are specified in UTF-16 code units. I've tried many ways to get a substring from the text (with mb_convert_encoding, iconv, json_encode etc.), but I did not get the correct link. It works for plain text without emoji but not with them.
Asked
Active
Viewed 1,086 times
1 Answers
1
$output = json_decode(file_get_contents('php://input'), TRUE);
$message = $output['message']['text'];
$entities = $output['message']['entities'];
function getURLs($message, $entities) {
$URLs = [];
//$message_encode = iconv('utf-8', 'utf-16le', $message); //or utf-16
$message_encode = mb_convert_encoding($message, "UTF-16", "UTF-8"); //or utf-16le
foreach ($entities as $entitie) {
if ($entitie['url']) {
$URLs[] = $entitie['url'];
}
if ($entitie['type']=='url') {
$URL16 = substr($message_encode, $entitie['offset']*2, $entitie['length']*2);
//$URLs[] = iconv('utf-16le', 'utf-8', $URL16);
$URLs[] = mb_convert_encoding($URL16, "UTF-8", "UTF-16");
}
}
return $URLs;
}
$URLs = getURLs($message, $entities);
You can use iconv or mb_convert_encoding, UTF-16le or UTF-16. See also PHP - length of string containing emojis/special chars

Serhii
- 56
- 5