0

To better illustrate the question, I will state a couple of inputs together with the desired outputs:

  • INPUT 1: This中文5142
  • OUTPUT 1: array('This', '中文', '5142')

  • INPUT 2: This 中文,5142

  • OUTPUT 2: array('This', '中文', '5142')

So basically, the input string can either have white space or not, and the sequence of the English letters, numbers and Chinese characters is unknown and can occur more than once.

I found out this one that can do the job when there is no Chinese characters (reference: Splitting string containing letters and numbers):

$array = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $str);

I can roughly understand the above regular expression:

  1. (,?\s+) - split by white spaces and so
  2. (?<=[a-z])(?=\d) - if a number is right after a letter, split them
  3. (?<=\d)(?=[a-z]) - if a letter is right after a number, split them

So I was thinking naively like this: I will need to do a total of 3 things:

  1. if a number is right after a letter or Chinese character, split them
  2. if a letter is right after a number or Chinese character, split them
  3. if a Chinese character is right after a letter or number, split them

To achieve 1, I was thinking like this:

(?<=[a-z\x4E00-\x9FA5])(?=\d)

where \x4E00-\x9FA5 is to match Chinese characters. But this doesn't work!

Community
  • 1
  • 1
Dainy
  • 89
  • 9

1 Answers1

4

To do that in an explicit way, you can use:

$result = preg_split('~(?<!\p{Latin})(?=\p{Latin})|(?<!\p{Han})(?=\p{Han})|(?<![0-9])(?=[0-9])~u', $str, -1, PREG_SPLIT_NO_EMPTY);

(that splits the string on each boundary). Note that if you have only three kind of characters, you can remove one of the boundaries (the one you want).

If you want to remove white-spaces from the result, you can put all in a non-capturing group and add \s* at the beginning of the pattern.

However using preg_match_all may give the same result with less effort:

if (preg_match_all('~\p{Latin}+|\p{Han}+|[0-9]+~u', $str, $matches))
    $result = $matches[0];

The u modifier forces the regex engine to read the string as an UTF8 string.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Both methods work like a charm. Thanks a lot. But I may need a couple of minutes to understand the express :( – Dainy Jul 04 '15 at 23:02