To better illustrate the question, I will state a couple of inputs together with the desired outputs:
- INPUT 1: This中文5142
OUTPUT 1: array('This', '中文', '5142')
INPUT 2: This 中文,5142
- OUTPUT 2: array('This', '中文', '5142')
So basically, the input string can either have white space or not, and the sequence of the English letters, numbers and Chinese characters is unknown and can occur more than once.
I found out this one that can do the job when there is no Chinese characters (reference: Splitting string containing letters and numbers):
$array = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $str);
I can roughly understand the above regular expression:
- (,?\s+) - split by white spaces and so
- (?<=[a-z])(?=\d) - if a number is right after a letter, split them
- (?<=\d)(?=[a-z]) - if a letter is right after a number, split them
So I was thinking naively like this: I will need to do a total of 3 things:
- if a number is right after a letter or Chinese character, split them
- if a letter is right after a number or Chinese character, split them
- if a Chinese character is right after a letter or number, split them
To achieve 1, I was thinking like this:
(?<=[a-z\x4E00-\x9FA5])(?=\d)
where \x4E00-\x9FA5 is to match Chinese characters. But this doesn't work!