1

I have a problem with telegram bot api. I'm trying to extract a URL from a message. It is written in the MessageEntity type that the offset and length are specified in UTF-16 code units. I've tried many ways to get a substring from the text (with mb_convert_encoding, iconv, json_encode etc.), but I did not get the correct link. It works for plain text without emoji but not with them.

1 Answers1

1
$output = json_decode(file_get_contents('php://input'), TRUE); 
$message = $output['message']['text'];
$entities = $output['message']['entities'];

function getURLs($message, $entities) { 

    $URLs = [];

    //$message_encode = iconv('utf-8', 'utf-16le', $message); //or utf-16
    $message_encode = mb_convert_encoding($message, "UTF-16", "UTF-8"); //or utf-16le

    foreach ($entities as $entitie) {

        if ($entitie['url']) {
            $URLs[] = $entitie['url'];
        }

        if ($entitie['type']=='url') {
            $URL16 = substr($message_encode, $entitie['offset']*2, $entitie['length']*2);

            //$URLs[] = iconv('utf-16le', 'utf-8', $URL16);
            $URLs[] = mb_convert_encoding($URL16, "UTF-8", "UTF-16");
        }

    }

    return $URLs;

}

$URLs = getURLs($message, $entities);

You can use iconv or mb_convert_encoding, UTF-16le or UTF-16. See also PHP - length of string containing emojis/special chars

Serhii
  • 56
  • 5