How to split English letters, numbers and Chinese characters?

Question

To better illustrate the question, I will state a couple of inputs together with the desired outputs:

INPUT 1: This中文5142
OUTPUT 1: array('This', '中文', '5142')
INPUT 2: This 中文,5142
OUTPUT 2: array('This', '中文', '5142')

So basically, the input string can either have white space or not, and the sequence of the English letters, numbers and Chinese characters is unknown and can occur more than once.

I found out this one that can do the job when there is no Chinese characters (reference: Splitting string containing letters and numbers):

$array = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $str);

I can roughly understand the above regular expression:

(,?\s+) - split by white spaces and so
(?<=[a-z])(?=\d) - if a number is right after a letter, split them
(?<=\d)(?=[a-z]) - if a letter is right after a number, split them

So I was thinking naively like this: I will need to do a total of 3 things:

if a number is right after a letter or Chinese character, split them
if a letter is right after a number or Chinese character, split them
if a Chinese character is right after a letter or number, split them

To achieve 1, I was thinking like this:

(?<=[a-z\x4E00-\x9FA5])(?=\d)

where \x4E00-\x9FA5 is to match Chinese characters. But this doesn't work!

Grab them one by one using `[a-zA-Z]*`, `[0-9]*` and `[^a-zA-Z0-9]*`. — ShellFish, Jul 04 '15 at 22:46
Thanks for the quick response. But do you mind writing down a complete regular expression? Thanks. — Dainy, Jul 04 '15 at 22:48

Casimir et Hippolyte · Accepted Answer · 2015-07-04T23:05:45.773

To do that in an explicit way, you can use:

$result = preg_split('~(?<!\p{Latin})(?=\p{Latin})|(?<!\p{Han})(?=\p{Han})|(?<![0-9])(?=[0-9])~u', $str, -1, PREG_SPLIT_NO_EMPTY);

(that splits the string on each boundary). Note that if you have only three kind of characters, you can remove one of the boundaries (the one you want).

If you want to remove white-spaces from the result, you can put all in a non-capturing group and add \s* at the beginning of the pattern.

However using preg_match_all may give the same result with less effort:

if (preg_match_all('~\p{Latin}+|\p{Han}+|[0-9]+~u', $str, $matches))
    $result = $matches[0];

The u modifier forces the regex engine to read the string as an UTF8 string.

Both methods work like a charm. Thanks a lot. But I may need a couple of minutes to understand the express :( — Dainy, Jul 04 '15 at 23:02

How to split English letters, numbers and Chinese characters?

1 Answers1