20

In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?

I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters.

Why not run straight for the mb_ family of functions, as the first couple of answers didn't?

IMSoP
  • 89,526
  • 13
  • 117
  • 169
joeforker
  • 40,459
  • 37
  • 151
  • 246
  • 1
    Do you realize that comparing Unicode characters is non-trivial, depending on the type of compare you want? E.g., you can write ü as either U+00DC or as U+0075 U+0308. – derobert Sep 08 '09 at 21:34
  • Yes, I do realize that. If it became a problem then I would need to transform the input to one of the Unicode normal forms before the split. – joeforker Sep 08 '09 at 22:10
  • There is mb_ function since PHP 7.4. – Patryk Godowski Jan 21 '21 at 14:19

8 Answers8

23

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

For instance, considering this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);

You'll get an unusable result:

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '�' (length=1)
  5 => string '�' (length=1)
  6 => string '�' (length=1)
  7 => string '�' (length=1)
  8 => string '�' (length=1)
  9 => string '�' (length=1)
  10 => string '�' (length=1)
  11 => string '�' (length=1)
  12 => string '�' (length=1)
  13 => string '�' (length=1)
  14 => string '�' (length=1)
  15 => string '�' (length=1)
  16 => string ',' (length=1)
  17 => string ' ' (length=1)
  18 => string 'e' (length=1)
  19 => string 'f' (length=1)
  20 => string 'g' (length=1)

But, with this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);

(Notice the 'u' at the end of the regex)

You get what you want :

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '文' (length=3)
  5 => string '字' (length=3)
  6 => string '化' (length=3)
  7 => string 'け' (length=3)
  8 => string ',' (length=1)
  9 => string ' ' (length=1)
  10 => string 'e' (length=1)
  11 => string 'f' (length=1)
  12 => string 'g' (length=1)

Hope this helps :-)

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Users or PHP 7.4 or later - scroll down to [this](https://stackoverflow.com/a/65281708/763419) answer, using ``mb_str_split()`` – William Turrell Nov 13 '21 at 15:48
14

Slightly simpler than preg_match_all:

preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)

This gives you back a 1-dimensional array of characters. No need for a matches object.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • This answer is the one that makes most sense, i.e. logically the goal is to split, we don't care about matching every single character (even though the same may be done in the background). I was about to answer the question with your solution, but with a little difference: the limit (3rd parameter) may have `NULL` instead of `-1` because «`-1`, `0` or `NULL` means "no limit" and, as is standard across PHP, you can use `NULL` to [skip to the flags parameter](http://php.net/manual/en/function.preg-split.php)». – Armfoot Nov 20 '15 at 13:26
8

It's worth mentioning that since PHP 7.4 there's a built-in function, mb_str_split, that does this.

$chars = mb_str_split($str);

Unlike preg_split('//u', $str) this supports encodings other than UTF-8.

Ruby Tunaley
  • 351
  • 3
  • 4
6

Try this:

preg_match_all('/./u', $text, $array);
JasonWoof
  • 4,176
  • 1
  • 19
  • 28
1

If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8 which is abandoned but might be helping you if you decide to do it on your own.

In particular have a look at the class Zend_Locale_UTF8_PHP5_String which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).

EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:

/**
 * Returns the UTF-8 code sequence as an array for any given $string.
 *
 * @access protected
 * @param string|integer $string
 * @return array
 */
protected function _decode( $string ) {

    $string     = (string) $string;
    $length     = strlen($string);
    $sequence   = array();

    for ( $i=0; $i<$length; ) {
        $bytes      = $this->_characterBytes($string, $i);
        $ord        = $this->_ord($string, $bytes, $i);

        if ( $ord !== false )
            $sequence[] = $ord;

        if ( $bytes === false )
            $i++;
        else
            $i  += $bytes;
    }

    return $sequence;

}

/**
 * Returns the UTF-8 code of a character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $bytes
 * @param integer $position
 * @return integer
 */
protected function _ord( &$string, $bytes = null, $pos=0 )
{
    if ( is_null($bytes) )
        $bytes = $this->_characterBytes($string);

    if ( strlen($string) >= $bytes ) {

        switch ( $bytes ) {
            case 1:
                return ord($string[$pos]);
                break;

            case 2:
                return  ( (ord($string[$pos])   & 0x1f) << 6 ) +
                        ( (ord($string[$pos+1]) & 0x3f) );
                break;

            case 3:
                return  ( (ord($string[$pos])   & 0xf)  << 12 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                        ( (ord($string[$pos+2]) & 0x3f) );
                break;

            case 4:
                return  ( (ord($string[$pos])   & 0x7)  << 18 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 12 ) + 
                        ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                        ( (ord($string[$pos+2]) & 0x3f) );
                break;

            case 0:
            default:
                return false;
        }
    }

    return false;
}
/**
 * Returns the number of bytes of the $position-th character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $position
 */
protected function _characterBytes( &$string, $position = 0 ) {
    $char       = $string[$position];
    $charVal    = ord($char);

    if ( ($charVal & 0x80) === 0 )
        return 1;

    elseif ( ($charVal & 0xe0) === 0xc0 )
        return 2;

    elseif ( ($charVal & 0xf0) === 0xe0 )
        return 3;

    elseif ( ($charVal & 0xf8) === 0xf0)
        return 4;
    /*
    elseif ( ($charVal & 0xfe) === 0xf8 )
        return 5;
    */

    return false;
}
André Hoffmann
  • 3,505
  • 1
  • 25
  • 39
1
function str_split_unicode($str, $l = 0) {
    if ($l > 0) {
        $ret = array();
        $len = mb_strlen($str, "UTF-8");
        for ($i = 0; $i < $len; $i += $l) {
            $ret[] = mb_substr($str, $i, $l, "UTF-8");
        }
        return $ret;
    }
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
var_dump(str_split_unicode("لأآأئؤة"));

output:

array (size=7)
  0 => string 'ل' (length=2)
  1 => string 'أ' (length=2)
  2 => string 'آ' (length=2)
  3 => string 'أ' (length=2)
  4 => string 'ئ' (length=2)
  5 => string 'ؤ' (length=2)
  6 => string 'ة' (length=2)

for more information : http://php.net/manual/en/function.str-split.php

0

I was able to write a solution using mb_*, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:

$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
$length = mb_strlen($japanese2, "UTF-16");
for($i=0; $i<$length; $i++) {
    $char = mb_substr($japanese2, $i, 1, "UTF-16");
    $utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
    print $utf8 . "\n";
}

I had better luck avoiding mb_internal_encoding and just specifying everything at each mb_* call. I'm sure I'll wind up using the preg solution.

joeforker
  • 40,459
  • 37
  • 151
  • 246
0

the best way for split with length: I just changed laravel str_limit() function:

    public static function split_text($text, $limit = 100, $end = '')
{
    $width=mb_strwidth($text, 'UTF-8');
    if ($width <= $limit) {
        return $text;
    }
    $res=[];
    for($i=0;$i<=$width;$i=$i+$limit){
        $res[]=rtrim(mb_strimwidth($text, $i, $limit, '', 'UTF-8')).$end;
    }
     return $res;
}
Solivan
  • 695
  • 1
  • 8
  • 16