Split utf8 string into array of chars

Question

I'm trying to split a utf8 encoded string into an array of chars. The function that I now use used to work, but for some reason it doesn't work anymore. What could be the reason. And better yet, how can I fix it?

This is my string:

Zelf heb ik maar één vraag: wie ben jij?

This is my function:

function utf8Split($str, $len = 1)
{
  $arr = array();
  $strLen = mb_strlen($str);
  for ($i = 0; $i < $strLen; $i++)
  {
    $arr[] = mb_substr($str, $i, $len);
  }
  return $arr;
}

This is the result:

Array
(
    [0] => Z
    [1] => e
    [2] => l
    [3] => f
    [4] =>  
    [5] => h
    [6] => e
    [7] => b
    [8] =>  
    [9] => i
    [10] => k
    [11] =>  
    [12] => m
    [13] => a
    [14] => a
    [15] => r
    [16] =>  
    [17] => e
    [18] => ́
    [19] => e
    [20] => ́
    [21] => n
    [22] =>  
    [23] => v
    [24] => r
    [25] => a
    [26] => a
    [27] => g
    [28] => :
    [29] =>  
    [30] => w
    [31] => i
    [32] => e
    [33] =>  
    [34] => b
    [35] => e
    [36] => n
    [37] =>  
    [38] => j
    [39] => i
    [40] => j
    [41] => ?
)

Define "doesn't work". What is it doing that it's not supposed to be doing and/or what is it not doing that it's supposed to be doing? — , Feb 24 '12 at 21:20

score 16 · Answer 1 · edited Jan 05 '23 at 16:05

16

This is the best solution!:

I've found this nice solution in the PHP manual pages.

preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);

It works really fast:

In PHP 5.6.18 it split a 6 MB big text file in a matter of seconds.

Best of all. It doesn't need MultiByte (mb_) support!

Similar answer also here.

edited Jan 05 '23 at 16:05

François

1,831
20
33

answered May 12 '16 at 16:02

Yani2000

854
8
11

score 12 · Answer 2 · answered Feb 24 '12 at 21:26

For the mb_... functions you should specify the charset encoding.

In your example code these are especially the following two lines:

$strLen = mb_strlen($str, 'UTF-8');
$arr[] = mb_substr($str, $i, $len, 'UTF-8');

The full picture:

function utf8Split($str, $len = 1)
{
  $arr = array();
  $strLen = mb_strlen($str, 'UTF-8');
  for ($i = 0; $i < $strLen; $i++)
  {
    $arr[] = mb_substr($str, $i, $len, 'UTF-8');
  }
  return $arr;
}

Because you're using UTF-8 here. However, if the input is not properly encoded, this won't work "any longer" - just because it has not been designed for something else.

You can alternativly process UTF-8 encoded strings with PCRE regular expressions, for example this will return what you're looking for in less code:

$str = 'Zelf heb ik maar één vraag: wie ben jij?';

$chars = preg_split('/(?!^)(?=.)/u', $str);

Next to preg_split there is also mb_split.

I specify the encoding globally with: mb_internal_encoding('UTF-8'); — tersmitten, Feb 25 '12 at 07:37
That should set it (but it sets also HTTP input and output encoding), you could analyze the string (e.g. [with a hexdump](http://stackoverflow.com/questions/1057572/how-can-i-get-a-hex-dump-of-a-string-in-php)) and check the string encoding firsthand, I suspect either the encoding setting is not right or the charset encoding of the string is something else than UTF-8. — hakre, Feb 25 '12 at 13:50

score 4 · Answer 3 · answered Mar 23 '12 at 15:04

If you not sure about availability of mb_string function library, then use:

Version 1:

function utf8_str_split($str='',$len=1){
    preg_match_all("/./u", $str, $arr);
    $arr = array_chunk($arr[0], $len);
    $arr = array_map('implode', $arr);
    return $arr;
}

Version 2:

function utf8_str_split($str='',$len=1){
    return preg_split('/(?<=\G.{'.$len.'})/u', $str,-1,PREG_SPLIT_NO_EMPTY);
}

Both functions tested in PHP5

score 3 · Answer 4 · answered Feb 24 '12 at 21:22

3

There is a multibyte split function in PHP, mb_split.

answered Feb 24 '12 at 21:22

bfavaretto

71,580
16
111
150

Be certain to set the `mb_regex_encoding()`, too! – Anthony Rutledge Dec 07 '19 at 17:49

score 1 · Accepted Answer · answered Mar 06 '12 at 08:56

1

I found out the é was not the character I expected. Apparently there is a difference between né and ńe. I got it working by normalizing the string first.

answered Mar 06 '12 at 08:56

tersmitten

1,310
1
9
23

score 0 · Answer 6 · answered Sep 17 '22 at 08:03

0

Since php 7.4, you can use mb_str_split:

$arr = mb_str_split($str);

answered Sep 17 '22 at 08:03

yolenoyer

8,797
2
27
61

user956584 · Answer 7 · 2012-02-25T00:32:09.983

0

mb_internal_encoding("UTF-8");

46 arrays - off 41 arrays

edited Feb 25 '12 at 00:32

answered Feb 24 '12 at 21:51

user956584

5,316
3
40
50

Could you clarify this answer? – Anthony Rutledge Dec 07 '19 at 17:48

Split utf8 string into array of chars

7 Answers7

Linked