Sperate consecutive Chinese characters

Question

I have a list of string which contains alphabet letters and Chinese character only, looking for an efficient way to turn the below sample from ['江河i河流VNX', 'ws', '上午好d'] to ['江', '河', 'i', '河', '流', 'VNX', 'ws', '上', '午', '好', 'd'].

The rule I should follow is to separate consecutive Chinese characters and keep consecutive alphabet letters.

Turned the comments into a more comprehensive answer. You're very welcome. — metatoaster, Jun 02 '21 at 06:15

score 0 · Answer 1 · answered Jun 02 '21 at 06:13

You can just iterate through strings with Chinese as normal strings. Each character can be indexed individually.

This solution works for your input case. It may not be the most efficient way to do it, but I think it shows the general idea.

import string
english_chars = list(string.ascii_lowercase)

a = ['江河i河流VNX', 'ws', '上午好d']
b = [] # output list. Creating a separate one for now, but you could figure out a way to override it.

for series in a:
    english_series = ""
    for char in series:
        if char.lower() in english_chars:
            english_series += char

        else:
            if len(english_series) != 0:
                b.append(english_series)
                english_series = ""
             
            b.append(char)
    if len(english_series) != 0:
        b.append(english_series)

metatoaster · Accepted Answer · 2021-06-02T06:19:14.537

If you want to group characters in the input solely by latin alphabet characters, and separate out any non-related characters, using re.findall with the pattern ([a-zA-Z]+|[^a-zA-Z]) will achieve the goal. This will find any one or more latin alphabet characters, or find a single character of the inverse set. Example:

>>> import re
>>> re.findall('([a-zA-Z]+|[^a-zA-Z])', '江河i河流VNX')
['江', '河', 'i', '河', '流', 'VNX']

Alternatively, if you are only interested in separating out the CJK Unified Ideographs into their own single characters, and keep the rest as a sequence, do the inverse:

>>> re.findall(r'([\u4E00-\u9FFF]|[^\u4E00-\u9FFF]+)', '江河i河流VNX')
['江', '河', 'i', '河', '流', 'VNX']

This related thread has a more extensive discussion in finding Chinese text in a string. You may include additional ranges of characters you wish to group or not to group inside the range expression of the pattern passed to re.findall.

For dealing with a list of strings, you may wish to apply the argument as ''.join([<various strings>]), and pass that string to the relevant argument, or if the output need to be distinct, map the input list of strings to re.findall (e.g. with list comprehension), and then chain them together.

Sperate consecutive Chinese characters

2 Answers2