57

How to iterate a UTF-8 string character by character using indexing?

When you access a UTF-8 string with the bracket operator $str[0] the utf-encoded character consists of 2 or more elements.

For example:

$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";

but I would like to have:

$str[0] = "K";
$str[1] = "ą";
$str[2] = "t";

It is possible with mb_substr but this is extremely slow, ie.

mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"

Is there another way to interate the string character by character without using mb_substr?

czuk
  • 6,218
  • 10
  • 36
  • 47
  • 2
    define "extremely slow". Did you profile your application and found that these mb_substr calls is a certain bottleneck? – Your Common Sense Sep 08 '10 at 09:31
  • After reading your question 2nd time I realized you wanted a way to do it without mb_substr. I have deleted my answer. – Richard Knop Sep 08 '10 at 09:40
  • 1
    @Col. Shrapnel: Yes, 50% of processing time was made by `mb_substr`. – czuk Sep 08 '10 at 09:50
  • 50% of what processing? of whole user request to web-server, from connect to disconnect? I can't believe. Your whole script being parsed the same way on each request. Nobody ever notice that. What part your mb parsing does take of whole request time? – Your Common Sense Sep 08 '10 at 10:00
  • The script is run in cli so there are no additional delays. – czuk Sep 08 '10 at 10:05
  • Well, my apologies. Such an extremely rare case when string parsing does matter. – Your Common Sense Sep 08 '10 at 10:27
  • Similar: http://stackoverflow.com/q/3999337/209139. – TRiG May 14 '12 at 10:34
  • 10
    I'm surprised no one else suggested this, but if you wanted the fastest solution, and can live with up to 4 x memory overhead for the string, [converting to UTF-32](http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-iterate-characterwise-over-a-string) will give you fixed-width characters of 4 bytes each - if you need random access to any character in a string, this is probably the most efficient solution, and unless you're processing very large files, the memory overhead is likely acceptable. – mindplay.dk Jun 25 '14 at 10:49

8 Answers8

76

Use preg_split. With "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Mark Amery
  • 143,130
  • 81
  • 406
  • 459
vartec
  • 131,205
  • 36
  • 218
  • 244
  • 3
    This is very elegant, but I'm having a hard time imagining this *faster* than `mb_substr()`. – Pekka Sep 08 '10 at 09:40
  • 4
    @Pekka It probably is. Using `mb_substr` is quadratic on the length of the string; this is linear even though there's the overhead of building an array. Of course, it takes a lot more memory than your method. – Artefacto Sep 08 '10 at 09:53
  • 6
    I've just tested it. For string of length 100 characters, the preg_split is 50% *faster*. – vartec Sep 08 '10 at 09:58
  • 8
    Even more, I have tested on more than 1000 'long' documents and it is 40 times faster :-) (see my answer). – czuk Sep 08 '10 at 10:06
  • This solution is OK and I have applied in. – czuk Sep 08 '10 at 10:09
  • I try to avoid regex at all cost in PHP. It's fine if you can guarantee that $str is short, but if it gets too long, the recursion can cause PHP to crash. It can also simply return an error without the user realizing it. `preg_match`, for example, returns 0 if there is no match, and FALSE if there's an error. This can result in some insecure code. I would recommend [this fix](http://stackoverflow.com/a/14366023/793036) (With no `mbstring.func_overload`) or [this fix](http://stackoverflow.com/a/17156392/793036) (With `mbstring.func_overload = 7`) – Andrew Jun 17 '13 at 20:57
  • @Andrew: how big of a string are you talking about? Few million characters? – vartec Jun 18 '13 at 10:36
  • I had a bug that was caused by preg_match crashing trying to match over a string about 11000 characters long. If I had a higher `pcre.recursion_limit`, it would have returned `false` instead of crashing. – Andrew Jun 18 '13 at 15:16
  • 1
    @Andrew: and what makes you think above code would need recursion? PCRE uses recursion when [looking for longer matches](http://regexkit.sourceforge.net/Documentation/pcre/pcrestack.html). Except there can be no longer matches for an empty regexp. – vartec Jun 18 '13 at 15:34
  • @vartec: Nothing. I said I just avoid regex in PHP. Besides, the answer below says: _Preg split will fail over very large strings with a memory exception_ I would probably only learn that by getting a bug report from someone two years from now. I just like the other answer better. No big deal. – Andrew Jun 18 '13 at 19:06
  • is there any way to split newline characters as single element to the array? I am always getting it prepended to the chars? – rokdd Dec 01 '13 at 15:02
  • @rokdd you could probably change the RegEx to this `'/[\r\n]?/u'` to have newlines split also. Not sure how that affects RegEx performance without testing though. If it is significantly slower, then maybe consider just `trim()`ing items as you see fit. – Ezekiel Victor Aug 10 '14 at 08:04
44

Preg split will fail over very large strings with a memory exception and mb_substr is slow indeed, so here is a simple, and effective code, which I'm sure, that you could use:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    $char = ord($string[$pointer]);
    if($char < 128){
        return $string[$pointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }else{
            $bytes = 4;
        }
        $str =  substr($string, $pointer, $bytes);
        $pointer += $bytes;
        return $str;
    }
}

This I used for looping through a multibyte string char by char and if I change it to the code below, the performance difference is huge:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    return mb_substr($string, $pointer++, 1, 'UTF-8');
}

Using it to loop a string for 10000 times with the code below produced a 3 second runtime for the first code and 13 seconds for the second code:

function microtime_float(){
    list($usec, $sec) = explode(' ', microtime());
    return ((float)$usec + (float)$sec);
}

$source = 'árvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógép';

$t = Array(
    0 => microtime_float()
);

for($i = 0; $i < 10000; $i++){
    $pointer = 0;
    while(($chr = nextchar($source, $pointer)) !== false){
        //echo $chr;
    }
}

$t[] = microtime_float();

echo $t[1] - $t[0].PHP_EOL.PHP_EOL;
Lajos Mészáros
  • 3,756
  • 2
  • 20
  • 26
  • 1
    Upvoted and added a change I was required to make due to `mbstring.func_overload` being set to 7 in my environment. – Andrew Jun 17 '13 at 20:44
  • 3
    elseif($char = 252){ should probably be elseif($char == 252){ – user23127 May 04 '14 at 10:23
  • That missing "=" is the very reason I have shifted to Yoda notation for comparisons of variables. – Kafoso May 30 '16 at 06:43
  • After reading about what Yoda notation is, I would say that it sounds really useful. Thanks for mentioning it. – Lajos Mészáros May 30 '16 at 15:32
  • About code readability this is also a nice case to use the `switch(true)` "trick". e.g: `switch (true) { case ($char < 224): $bytes=2; break; case ($char < 240): $bytes=3; break; case ($char < 248): $bytes=4; break; case ($char == 252): $bytes=5; break; default: $bytes = 6; break; }` – Yuval A. Jun 09 '17 at 23:38
  • Small update: since 2003, a UTF-8 character can only take up up to 4 bytes. The last elseif/else statements can be removed. – user3389196 Jun 13 '19 at 18:25
  • Updated the code, thank you for spotting that! For reference: https://stackoverflow.com/a/9533324/1806628 – Lajos Mészáros Jun 18 '19 at 09:55
28

In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split with mb_substr.

alt text

The image shows, that preg_split took 1.2s, while mb_substr almost 25s.

Here is the code of the functions:

function split_preg($str){
    return preg_split('//u', $str, -1);     
}

function split_mb($str){
    $length = mb_strlen($str);
    $chars = array();
    for ($i=0; $i<$length; $i++){
        $chars[] = mb_substr($str, $i, 1);
    }
    $chars[] = "";
    return $chars;
}
czuk
  • 6,218
  • 10
  • 36
  • 47
11

Using Lajos Meszaros' wonderful function as inspiration I created a multi-byte string iterator class.

// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
    private $iPos   = 0;
    private $iSize  = 0;
    private $sStr   = null;

    // Constructor
    public function __construct(/*string*/ $str)
    {
        // Save the string
        $this->sStr     = $str;

        // Calculate the size of the current character
        $this->calculateSize();
    }

    // Calculate size
    private function calculateSize() {

        // If we're done already
        if(!isset($this->sStr[$this->iPos])) {
            return;
        }

        // Get the character at the current position
        $iChar  = ord($this->sStr[$this->iPos]);

        // If it's a single byte, set it to one
        if($iChar < 128) {
            $this->iSize    = 1;
        }

        // Else, it's multi-byte
        else {

            // Figure out how long it is
            if($iChar < 224) {
                $this->iSize = 2;
            } else if($iChar < 240){
                $this->iSize = 3;
            } else if($iChar < 248){
                $this->iSize = 4;
            } else if($iChar == 252){
                $this->iSize = 5;
            } else {
                $this->iSize = 6;
            }
        }
    }

    // Current
    public function current() {

        // If we're done
        if(!isset($this->sStr[$this->iPos])) {
            return false;
        }

        // Else if we have one byte
        else if($this->iSize == 1) {
            return $this->sStr[$this->iPos];
        }

        // Else, it's multi-byte
        else {
            return substr($this->sStr, $this->iPos, $this->iSize);
        }
    }

    // Key
    public function key()
    {
        // Return the current position
        return $this->iPos;
    }

    // Next
    public function next()
    {
        // Increment the position by the current size and then recalculate
        $this->iPos += $this->iSize;
        $this->calculateSize();
    }

    // Rewind
    public function rewind()
    {
        // Reset the position and size
        $this->iPos     = 0;
        $this->calculateSize();
    }

    // Valid
    public function valid()
    {
        // Return if the current position is valid
        return isset($this->sStr[$this->iPos]);
    }
}

It can be used like so

foreach(new MbStrIterator("Kąt") as $c) {
    echo "{$c}\n";
}

Which will output

K
ą
t

Or if you really want to know the position of the start byte as well

foreach(new MbStrIterator("Kąt") as $i => $c) {
    echo "{$i}: {$c}\n";
}

Which will output

0: K
1: ą
3: t
Chris Nasr
  • 145
  • 1
  • 9
  • Very nice class! I just want to point out that max. 4 bytes per character are valid UTF-8 (see: [What is the maximum number of bytes for a UTF-8 encoded character?](https://stackoverflow.com/questions/9533258/what-is-the-maximum-number-of-bytes-for-a-utf-8-encoded-character)). Characters with more bytes should be treated as errors (see: [Are 6 octet UTF-8 sequences valid?](https://stackoverflow.com/questions/3559161/are-6-octet-utf-8-sequences-valid)) – Minding Sep 11 '19 at 13:18
6

You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.

you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.

The Wikipedia article has the interpretation table for each character [retrieved 2010-10-01]:

   0-127 Single-byte encoding (compatible with US-ASCII)
 128-191 Second, third, or fourth byte of a multi-byte sequence
 192-193 Overlong encoding: start of 2-byte sequence, 
         but would encode a code point ≤ 127
  ........
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
3

I had the same issue as OP and I try to avoid regex in PHP since it fails or even crashes with long strings. I used Mészáros Lajos' answer with some changes since I have mbstring.func_overload set to 7.

function nextchar($string, &$pointer, &$asciiPointer){
   if(!isset($string[$asciiPointer])) return false;
    $char = ord($string[$asciiPointer]);
    if($char < 128){
        $pointer++;
        return $string[$asciiPointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }elseif($char < 248){
            $bytes = 4;
        }elseif($char = 252){
            $bytes = 5;
        }else{
            $bytes = 6;
        }
        $str =  substr($string, $pointer++, 1);
        $asciiPointer+= $bytes;
        return $str;
    }
}

With mbstring.func_overload set to 7, substr actually calls mb_substr. So substr gets the right value in this case. I had to add a second pointer. One keeps track of the multi-byte char in the string, the other keeps track of the single-byte char. The multi-byte value is used for substr (since it's actually mb_substr), while the single-byte value is used for retrieving the byte in this fashion: $string[$index].

Obviously if PHP ever decides to fix the [] access to work properly with multi-byte values, this will fail. But also, this fix wouldn't be needed in the first place.

Community
  • 1
  • 1
Andrew
  • 1,571
  • 17
  • 31
2

I think the most efficient solution would be to work through the string using mb_substr. In each iteration of the loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.

If this description is not clear, let me know and I'll provide a working PHP function.

David Spector
  • 1,520
  • 15
  • 21
  • In my testing, using preg_split is faster than this – Chris Nov 24 '20 at 12:17
  • I think this would be correct. One would use "[.]" as the regular expression representing one character, which I hope means one Unicode character (haven't tested). But of course the fastest way to iterate through a string in PHP is to consider the string just an array of bytes. It is rarely needed to isolate each Unicode character in the string. Note that control characters such as newline are always one byte in UTF-8. Also, token separators are frequently the same as in English, and hence also one byte. UTF-8 is safe for such one-byte recognition. – David Spector Nov 25 '20 at 13:34
1

Since PHP 7.4 You can use mb_str_split.

https://www.php.net/manual/en/function.mb-str-split.php

$str = 'Kąt';
$chars = mb_str_split($str);
var_dump($chars);

array(3) {
  [0] =>
  string(1) "K"
  [1] =>
  string(2) "ą"
  [2] =>
  string(1) "t"
}
Michal Przybylowicz
  • 1,558
  • 3
  • 16
  • 22