10

I have a simple code in JS that I can't replicate in PHP if it comes to special characters.

This is the JS code (see JSFiddle for output):

var str = "t↙️"; //char "t" and special characters, emojis, etc..
document.write("Length is: "+str.length); // Length is: 19
for(var i=0; i<str.length; i++) {
  document.write("<br> charCodeAt(" + i + "): " + str.charCodeAt(i));
}

The first problem is that PHP strlen() and mb_strlen() already gives different results from JS (strlen: 39, mb_strlen: 11), however I managed to get the same with a custom JS_StringLength function (thanks to this SO answer).

Here is what I have in PHP so far (see phpFiddle for output):

<?php

function JS_StringLength($string) {
    return strlen(iconv('UTF-8', 'UTF-16LE', $string)) / 2;
}

function JS_charCodeAt($str, $index){
    //not working!

    $char = mb_substr($str, $index, 1, 'UTF-8');
    if (mb_check_encoding($char, 'UTF-8'))
    {
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    } else {
        return null;
    }
}

$str = "t↙️";

echo $str."\n";
//echo "Length is: ".strlen($str)."\n"; //wrong
echo "Length is: ".JS_StringLength($str)."\n"; //OK
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";
}

After a full day of Googling, and trying out everything I found, nothing gave the same results as JS. What should JS_charCodeAt be to get the same output as JS with similar performance?

Experimenting #1:
Enter my string into https://r12a.github.io/app-conversion/ (awesome stuff). Looks like JS works with UTF-16 code units (19) and PHP strlen counts UTF-8 code units (39).

Experimenting #2:
When using json_encode() on my string - of course - the result will almost be something like that, what JavaScript may uses. I even examined the original PHP source code of json_encode and how json_encode escapes strings, but.. well..


Before flagging as a duplicate, please make sure you test a solution with the string in the above examples (or random emojis) as ALL the charCodeAt implementations found here on stackoverflow are working with most of the special characters, but NOT with emojis.

frzsombor
  • 2,274
  • 1
  • 22
  • 40
  • 1
    http://stackoverflow.com/questions/10333098/utf-8-safe-equivelant-of-ord-or-charcodeat-in-php – yuvaraj bathrabagu Nov 28 '16 at 09:43
  • 1
    Possible duplicate of [How to convert javascript to PHP?](http://stackoverflow.com/questions/31802180/how-to-convert-javascript-to-php) – Ima Nov 28 '16 at 09:44
  • @yuvarajbathrabagu: Thanks, but I already tried the answers on that question. Unfortunately none of them worked. – frzsombor Nov 28 '16 at 09:54
  • 1
    @Ima: All the charCodeAt implementations I've (already) found there are working with most of the special characters, but not with emojis. Please check it for yourself. – frzsombor Nov 28 '16 at 10:02
  • http://php.net/manual/en/function.mb-strlen.php – FDisk Nov 28 '16 at 22:52
  • @FDisk I might miss something, but I couldn't use mb_strlen(), because of the following test: http://sandbox.onlinephpfunctions.com/code/56ca517836d9b5c2bd895dc7d72c40067d78084a – frzsombor Nov 28 '16 at 23:14
  • @frzsombor are you shure that it should be 19 symbols? what if convert them before checking the lenght? `mb_strlen(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str));` – FDisk Nov 29 '16 at 09:25
  • @FDisk I must threat it as 19 characters, because I have to get the same result as JS. And str.length returns 19 in JavaScript for this string. – frzsombor Nov 29 '16 at 12:06

3 Answers3

3

[UPDATE: See a better solution in the accepted answer]

Ok, so after almost two days, I think I've found an answer myself. The basic idea is that json_encode() escapes multibyte Unicode characters, in a form, that JS uses them (like = "\ud83d\ude18") for character counting, for the charCodeAt function, etc. So if we JSON encode the string, we can extract an array of simple characters, and escaped multibyte chars. This way, we can easily count the characters of the original string as UTF-16 code units (just like JS does). And of course, we can return the "charCodeAt" values (ord() on simple characters, and converting \uXXXX hex to dec on multibyte characters).

Problem: If I want to get the "JS charCodeAt" value for every character in a for loop (so basically convert a string to charcode list), this code will be slow on long texts, because preg_match_all in getUTF16CodeUnits will run once for every single character.
Workaround: Instead of calling getUTF16CodeUnits every time, store the matches array in a variable, and work with that. More details: FASTER VERSION (backup)

Code and demo:

<?php

function getUTF16CodeUnits($string) {
    $string = substr(json_encode($string), 1, -1);
    preg_match_all("/\\\\u[0-9a-fA-F]{4}|./mi", $string, $matches);
    return $matches[0];
}

function JS_StringLength($string) {
    return count(getUTF16CodeUnits($string));
}

function JS_charCodeAt($string, $index) {
    $utf16CodeUnits = getUTF16CodeUnits($string);
    $unit = $utf16CodeUnits[$index];
    
    if(strlen($unit) > 1) {
        $hex = substr($unit, 2);
        return hexdec($hex);
    }
    else {
        return ord($unit);
    }
}

$str = "t↙️";

echo "Length is: ".JS_StringLength($str)."\n";
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";
}

Improvements, fixes, comments are highly appreciated!

frzsombor
  • 2,274
  • 1
  • 22
  • 40
3

The way that JS handles UTF-16 is not ideal; charCodeAt is picking out code units for you, including surrogates in the emoji cases. If you want the real codepoint for each character, String.codePointAt() would be a better choice. That said, since your usecase wasn't explained, this achieves what you were originally asking for without the need for json related functions:

<?php

$original = 't↙️';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < iconv_strlen($converted, 'UTF-16LE'); $i++) {
    $character = iconv_substr($converted, $i, 1, 'UTF-16LE');
    $codeUnits = unpack('v*', $character);

    foreach ($codeUnits as $codeUnit) {
        echo $codeUnit . PHP_EOL;
    }
}

This converts the (assumed) UTF-8 string into UTF-16, then loops over each character. In UTF-16, each character is 2 or 4 bytes in size. Unpack with the v repeating formatter will return one short in the former case, or 2 in the latter (v is the unsigned short formatter).

It could also be implemented by looping over the UTF-8 and converting each character one-by-one; it doesn't make a great deal of difference though. Also the same could be achieved with the mb_* functions.


Edit

Since you've inquired about a quicker way of doing this, combining the above with the solution offered by nwellnhof gives better performance:

<?php

$original = 't↙️';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = ord($converted[$i]) + (ord($converted[$i+1]) << 8);
        echo $codeUnit . PHP_EOL;
}

First off, this converts the UTF-8 string into UTF-16LE. We're interested in writing out UTF-16 code units (as per the behaviour charCodeAt()), and these are represented by 16 bits. The loop is simply jumping 2 bytes at a time. For each iteration, it'll take the numeric value of the byte at that position, and add it to the next byte, left shifted by 8. The left shifting is because we're dealing with little endian formatted UTF-16.

By way of example, take consider the character BENGALI DIGIT ONE (). This is represented by a single UTF-16 code unit, 2535. It is easier to first off describe how this is encoded as UTF-16BE. The single code unit for this character would consume 16 bits:

0000100111100111 (2535)

In PHP, strings are effectively byte arrays. So, PHP sees this as:

$converted[0] = 00001001 (9)
$converted[1] = 11100111 (231)

Given the 2 above bytes, how do we obtain the code unit? What we really want to do is something like:

   0000100100000000 (2304)
+          11100111 (231)
=  0000100111100111 (2535)

But we can't do that, since we only have single bytes to play with. One way is to deal with this is to use integers instead, giving us a full 64 bits (8 bytes).. and we want to represent the code unit in integer form anyway, so that seems like a reasonable route. We can obtain the numeric value of each byte via ord():

ord($converted[0]) == 0000000000000000000000000000000000000000000000000000000000001001 == 9
ord($converted[1]) == 0000000000000000000000000000000000000000000000000000000011100111 = 231

And left shift the first value by 8:

   0000000000000000000000000000000000000000000000000000000000001001 (9) 
<< 0000000000000000000000000000000000000000000000000000000000001000 (8)
=  0000000000000000000000000000000000000000000000000000100100000000 (2304)

And then sum together, as before:

   0000000000000000000000000000000000000000000000000000100100000000 (2304)
+  0000000000000000000000000000000000000000000000000000000011100111 (231)
=  0000000000000000000000000000000000000000000000000000100111100111 (2535)

So we now have the correct code unit value of 2535. The only difference with UTF-16LE is the order of the bytes is reversed. So instead of left shifting the first byte by 8, we need to left shift the second byte.

P.S: An equivalent way of performing this step would be to do

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = unpack('v', $converted[$i] . $converted[$i+1]);
        echo $codeUnit . PHP_EOL;
}

The unpack function will do exactly as just described which the v formatter is supplied, which tells it to expect 16 bits arranged in little endian. It may be worth benchmarking the 2 if you're interested in optimising for speed.

nj_
  • 2,219
  • 1
  • 10
  • 12
  • Hi! Sorry for the late answer! I was about to accept this as the answer, but somehow this code is slower than my answer on long strings. Can you tell me why, and do have an idea how to speed it up? – frzsombor Apr 20 '17 at 17:17
  • Unfortunately I just realised that the last (asian 芳) character gives a different result in JS and with this PHP code. (I just reviewed this question again) – frzsombor Jun 21 '18 at 12:12
  • It looks like things went a bit awry when copying things around. The character in your original question is different to the one I had in my answer. In your question it was `` (https://codepoints.net/U+2F994), in my answer `芳` (https://codepoints.net/U+82B3). I've updated my answer. – nj_ Jun 22 '18 at 22:06
  • Thank you very much for updating your answer after such a long time! Before accepting as an answer, I will do some more tests, but it seems like the updated version works pretty well and it's also very good in performance! In the meantime, can you please explain the loop? (just to make sure I understand what I'm doing, not just copy pasting) :) – frzsombor Jun 25 '18 at 08:35
  • Thanks again, you are awesome!! – frzsombor Jun 27 '18 at 08:03
3

If you really want an equivalent of JavaScript's charCodeAt method, try:

function JS_charCodeAt($str, $index) {
    $utf16 = mb_convert_encoding($str, 'UTF-16LE', 'UTF-8');
    return ord($utf16[$index*2]) + (ord($utf16[$index*2+1]) << 8);
}

But charCodeAt is problematic and should be replaced with codePointAt. Most JavaScript code dealing with characters in the supplementary Unicode planes like Emojis and using charCodeAt is probably wrong. You can find code emulating codePointAt in the answers to the question UTF-8 safe equivalent of ord or charCodeAt() in PHP.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113