6

We have a bunch of surrogate pair (or 2-byte utf8?) characters such as �� which is the prayer hands emojis stored as UTF8 as 2 characters. When rendered in a browser this string renders as two ??

example:

I need to convert those to the hands emjoi using php but I simply cannot find a combination of iconv, utf8_decode, html_entity_decode etc to pull it off.

This site converts the �� properly:

http://www.convertstring.com/EncodeDecode/HtmlDecode

Paste in there the following string

Please join me in this prayer. ��❤️

You will notice the surragate pair (��) converts to

This site is claiming to use HTMLDecode but I cannot find anything inside php to pull this off. I have tried: iconv html_entity_decode and a few public libraries.

I admit I am no expert when it comes to converting character encodies around!

Tyler F
  • 101
  • 1
  • 7
  • Something interesting is that the pair ```❤️``` does render properly in HTML. Could be helpful. – Tyler F Nov 08 '17 at 19:38
  • Turns out this is not a pair, UTF-8 actually needs 4 characters to store it. `F0 9F 99 8F`. As per the [UTF-8 definition](https://en.wikipedia.org/wiki/UTF-8), it should convert to `🙏` or `🙏` if you wish to use decimals, and when I test it, it just works. If you are storing this in a MySQL database you need to specify the charset as `utf8mb4`, and not just `utf8`, or it will cause corruptions such as this. – Havenard Jul 19 '18 at 00:49
  • Also [jsfiddle](http://jsfiddle.net/ng30j8ua/3/) seems to disagree with the encoding conversion provided by this website you are using. – Havenard Jul 19 '18 at 00:56

2 Answers2

3

I was not able to find a function to do this, but this works:

$str = "Please join me in this prayer. ��❤️";
$newStr = preg_replace_callback("/&#.....;&#.....;/", function($matches){return convertToEmoji($matches);}, $str);
print_r($newStr);
function convertToEmoji($matches){
    $newStr = $matches[0];
    $newStr = str_replace("&#", '', $newStr);
    $newStr = str_replace(";", '##', $newStr);
    $myEmoji = explode("##", $newStr);
    $newStr = dechex($myEmoji[0]) . dechex($myEmoji[1]);
    $newStr = hex2bin($newStr);
    return iconv("UTF-16BE", "UTF-8", $newStr);
}
Thomas Orlita
  • 1,554
  • 14
  • 28
Tyler F
  • 101
  • 1
  • 7
  • Works, but hurts my eyes :) $string = preg_replace_callback("/([0-9]{5});([0-9]{5});/", function($matches) { return iconv("UTF-16BE", "UTF-8", hex2bin(dechex($matches[1]) . dechex($matches[2]))); }, $string); – Peter Knut Aug 16 '19 at 00:01
  • @tyler Thank you so much for this! Question though: what is the purpose of `$newStr = str_replace(";", '##', $newStr); $myEmoji = explode("##", $newStr);` when you could just do `$myEmoji = explode(";", $newStr);`? Also, just wanted to add that I was able to use `/(\d{5};){2}/` for my regex instead of `/.....;.....;/` in case that helps anyone else. – dynamiccookies Mar 25 '20 at 02:45
  • To anyone looking for this solution in another programming language, here's another post [(https://stackoverflow.com/q/48142634/4013327)](https://stackoverflow.com/q/48142634/4013327) with solutions in [R](https://stackoverflow.com/a/48154559/4013327), [JavaScript](https://stackoverflow.com/a/53711213/4013327), and [Go](https://stackoverflow.com/a/58262128/4013327). Plus, an excellent [detailed algorithm](https://stackoverflow.com/a/48143046/4013327) that can be used to build solutions in other languages. I was fortunate enough to have found this PHP solution through a link on the other post. – dynamiccookies Mar 25 '20 at 02:59
2

I'd like to take a moment to clean up TylerF's working code.

Code: (3v4l.org Demo)

$str = "Please join me in this prayer. ��❤️";
echo preg_replace_callback(
         "/&#(\d{5});&#(\d{5});/",
         function($m) {
             return iconv("UTF-16BE", "UTF-8", hex2bin(dechex($m[1]) . dechex($m[2])));
         },
         $str
     );

Original Output:

Please join me in this prayer. ❤️

Current Output:

Warning: iconv(): Wrong encoding, conversion from "UTF-16BE" to "UTF-8" is not allowed
  • Changed dots to digit character matching and employed capture groups to simplify subsequent processes.
  • No more str_replace() or explode() calls in the custom function.
  • No single-use variable declarations.

Same technique with PHP7.4 arrow function syntax (Sandbox demo that actually works):

$str = "Please join me in this prayer. ��❤️";
var_export(
    preg_replace_callback(
        "/&#(\d{5});&#(\d{5});/",
        fn($m) => iconv("UTF-16BE", "UTF-8", hex2bin(dechex($m[1]) . dechex($m[2]))),
        $str
    )
);
mickmackusa
  • 43,625
  • 12
  • 83
  • 136