-2

I want to get the length of UTF-8 strings in PHP code but i havn't access to cPanel host for enable multibyte String functions in PHP. is there any other way?

Meanwhile, I can not use strlen() function, because i get wrong length in UTF-8 strings.

kamal
  • 330
  • 3
  • 13

1 Answers1

2

Well, then you have to write it yourself.

UTF-8

In short, UTF-8 is encoded as follows:

  • If the leftmost bit of a certain byte is a 0, then it is a single-byte character.
  • If the leftmost bit of a certain byte is a 1, then it is part of a multibyte character.
    • If the 1 is followed by another number of 1s, then the number of bytes the character occupies is equal to the number of 1-bits, followed by a 0-bit.
    • Otherwise, the remaining parts of the multibyte character all start with the bits 10.

See here for more info.

For example, suppose we have the following string:

Hëllo현World
01001000 ═ H   --> Starts with 0, so it's a single-byte character
11000011 ╦ ë   --> Starts with two 1s followed by 0. Char takes up 2 bytes.
         ║         This byte is the first one of the 2 bytes. The remaining 1
         ║         byte MUST start with 10.
10101011 ╝     --> This is a 'continuation' byte, and MUST start with 10.
                   Well, it does, so it's valid.
01101100 ═ l   --> This byte start with 0, so it's a normal byte, again.
01101100 ═ l
01101111 ═ o
11101101 ╗     --> Starts with three 1-bits. So the character takes up 3 bytes.
         ║         The next 3-1=2 bytes must start with 10
10011000 ╬ 현  --> Continuation byte
10000100 ╝     --> Continuation byte
01010111 ═ W   --> Normal byte
01101111 ═ o
01110010 ═ r
01101100 ═ l
01100100 ═ d

Code

It is sufficient to just count all bytes not starting with bits 10. With other words, if the byte is not in the range 128-191 inclusive.

$str = "Hëllo현World";

// ë takes up 2 bytes
// 현 takes up 3 bytes
// In a decent browser you see 11 characters (ten Latin, one Chinese)

$len = 0;
for ($i = 0; $i < strlen($str); $i++) {
    $ascii = ord($str[$i]);
    if ($ascii < 128 || $ascii >= 192) {
        $len++;
    }
}

echo "Number of bytes: ".strlen($str)."\n";
echo "Number of characters: ".$len;

Here is an online demo.


PS: Is there a reason you don't want to enable multibyte strings?

MC Emperor
  • 22,334
  • 15
  • 80
  • 130