What is the best way to split a string into an array of Unicode characters in PHP?

Question

In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?

I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters.

Why not run straight for the mb_ family of functions, as the first couple of answers didn't?

Do you realize that comparing Unicode characters is non-trivial, depending on the type of compare you want? E.g., you can write ü as either U+00DC or as U+0075 U+0308. — derobert, Sep 08 '09 at 21:34
Yes, I do realize that. If it became a problem then I would need to transform the input to one of the Unicode normal forms before the split. — joeforker, Sep 08 '09 at 22:10

score 23 · Accepted Answer · edited Jan 10 '12 at 19:02

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

For instance, considering this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);

You'll get an unusable result:

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '�' (length=1)
  5 => string '�' (length=1)
  6 => string '�' (length=1)
  7 => string '�' (length=1)
  8 => string '�' (length=1)
  9 => string '�' (length=1)
  10 => string '�' (length=1)
  11 => string '�' (length=1)
  12 => string '�' (length=1)
  13 => string '�' (length=1)
  14 => string '�' (length=1)
  15 => string '�' (length=1)
  16 => string ',' (length=1)
  17 => string ' ' (length=1)
  18 => string 'e' (length=1)
  19 => string 'f' (length=1)
  20 => string 'g' (length=1)

But, with this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);

(Notice the 'u' at the end of the regex)

You get what you want :

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '文' (length=3)
  5 => string '字' (length=3)
  6 => string '化' (length=3)
  7 => string 'け' (length=3)
  8 => string ',' (length=1)
  9 => string ' ' (length=1)
  10 => string 'e' (length=1)
  11 => string 'f' (length=1)
  12 => string 'g' (length=1)

Hope this helps :-)

Users or PHP 7.4 or later - scroll down to [this](https://stackoverflow.com/a/65281708/763419) answer, using ``mb_str_split()`` — William Turrell, Nov 13 '21 at 15:48

mpen · Answer 2 · 2015-05-27T15:17:20.303

14

Slightly simpler than preg_match_all:

preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)

This gives you back a 1-dimensional array of characters. No need for a matches object.

edited May 27 '15 at 15:17

answered May 26 '15 at 20:33

mpen

272,448
266
850
1,236

This answer is the one that makes most sense, i.e. logically the goal is to split, we don't care about matching every single character (even though the same may be done in the background). I was about to answer the question with your solution, but with a little difference: the limit (3rd parameter) may have `NULL` instead of `-1` because «`-1`, `0` or `NULL` means "no limit" and, as is standard across PHP, you can use `NULL` to [skip to the flags parameter](http://php.net/manual/en/function.preg-split.php)». – Armfoot Nov 20 '15 at 13:26

Ruby Tunaley · Answer 3 · 2021-01-28T04:45:28.523

8

It's worth mentioning that since PHP 7.4 there's a built-in function, mb_str_split, that does this.

$chars = mb_str_split($str);

Unlike preg_split('//u', $str) this supports encodings other than UTF-8.

edited Jan 28 '21 at 04:45

answered Dec 13 '20 at 23:16

Ruby Tunaley

351
3
4

So, let's upvote this answer. :) @joeforker, maybe check this answer as correct one? – Patryk Godowski Jan 21 '21 at 14:17
This should be the "modern day" answer. It works where "preg_split( '//u", $string, -1, PREG_SPLIT_NO_EMPTY)" doesn't. – scott8035 Jul 07 '23 at 10:21

score 6 · Answer 4 · answered Sep 08 '09 at 21:35

6

Try this:

preg_match_all('/./u', $text, $array);

answered Sep 08 '09 at 21:35

JasonWoof

4,176
1
19
28

André Hoffmann · Answer 5 · 2009-09-08T21:58:19.773

If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8 which is abandoned but might be helping you if you decide to do it on your own.

In particular have a look at the class Zend_Locale_UTF8_PHP5_String which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).

EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:

/**
 * Returns the UTF-8 code sequence as an array for any given $string.
 *
 * @access protected
 * @param string|integer $string
 * @return array
 */
protected function _decode( $string ) {

    $string     = (string) $string;
    $length     = strlen($string);
    $sequence   = array();

    for ( $i=0; $i<$length; ) {
        $bytes      = $this->_characterBytes($string, $i);
        $ord        = $this->_ord($string, $bytes, $i);

        if ( $ord !== false )
            $sequence[] = $ord;

        if ( $bytes === false )
            $i++;
        else
            $i  += $bytes;
    }

    return $sequence;

}

/**
 * Returns the UTF-8 code of a character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $bytes
 * @param integer $position
 * @return integer
 */
protected function _ord( &$string, $bytes = null, $pos=0 )
{
    if ( is_null($bytes) )
        $bytes = $this->_characterBytes($string);

    if ( strlen($string) >= $bytes ) {

        switch ( $bytes ) {
            case 1:
                return ord($string[$pos]);
                break;

            case 2:
                return  ( (ord($string[$pos])   & 0x1f) << 6 ) +
                        ( (ord($string[$pos+1]) & 0x3f) );
                break;

            case 3:
                return  ( (ord($string[$pos])   & 0xf)  << 12 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                        ( (ord($string[$pos+2]) & 0x3f) );
                break;

            case 4:
                return  ( (ord($string[$pos])   & 0x7)  << 18 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 12 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                        ( (ord($string[$pos+2]) & 0x3f) );
                break;

            case 0:
            default:
                return false;
        }
    }

    return false;
}
/**
 * Returns the number of bytes of the $position-th character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $position
 */
protected function _characterBytes( &$string, $position = 0 ) {
    $char       = $string[$position];
    $charVal    = ord($char);

    if ( ($charVal & 0x80) === 0 )
        return 1;

    elseif ( ($charVal & 0xe0) === 0xc0 )
        return 2;

    elseif ( ($charVal & 0xf0) === 0xe0 )
        return 3;

    elseif ( ($charVal & 0xf8) === 0xf0)
        return 4;
    /*
    elseif ( ($charVal & 0xfe) === 0xf8 )
        return 5;
    */

    return false;
}

score 1 · Answer 6 · answered Feb 02 '19 at 07:05

function str_split_unicode($str, $l = 0) {
    if ($l > 0) {
        $ret = array();
        $len = mb_strlen($str, "UTF-8");
        for ($i = 0; $i < $len; $i += $l) {
            $ret[] = mb_substr($str, $i, $l, "UTF-8");
        }
        return $ret;
    }
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
var_dump(str_split_unicode("لأآأئؤة"));

output:

array (size=7)
  0 => string 'ل' (length=2)
  1 => string 'أ' (length=2)
  2 => string 'آ' (length=2)
  3 => string 'أ' (length=2)
  4 => string 'ئ' (length=2)
  5 => string 'ؤ' (length=2)
  6 => string 'ة' (length=2)

for more information : http://php.net/manual/en/function.str-split.php

score 0 · Answer 7 · answered Sep 09 '09 at 01:23

I was able to write a solution using mb_*, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:

$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
$length = mb_strlen($japanese2, "UTF-16");
for($i=0; $i<$length; $i++) {
    $char = mb_substr($japanese2, $i, 1, "UTF-16");
    $utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
    print $utf8 . "\n";
}

I had better luck avoiding mb_internal_encoding and just specifying everything at each mb_* call. I'm sure I'll wind up using the preg solution.

score 0 · Answer 8 · answered May 27 '18 at 10:03

the best way for split with length: I just changed laravel str_limit() function:

    public static function split_text($text, $limit = 100, $end = '')
{
    $width=mb_strwidth($text, 'UTF-8');
    if ($width <= $limit) {
        return $text;
    }
    $res=[];
    for($i=0;$i<=$width;$i=$i+$limit){
        $res[]=rtrim(mb_strimwidth($text, $i, $limit, '', 'UTF-8')).$end;
    }
     return $res;
}

What is the best way to split a string into an array of Unicode characters in PHP?

8 Answers8

Linked

Related