If you want to group characters in the input solely by latin alphabet characters, and separate out any non-related characters, using re.findall
with the pattern ([a-zA-Z]+|[^a-zA-Z])
will achieve the goal. This will find any one or more latin alphabet characters, or find a single character of the inverse set. Example:
>>> import re
>>> re.findall('([a-zA-Z]+|[^a-zA-Z])', '江河i河流VNX')
['江', '河', 'i', '河', '流', 'VNX']
Alternatively, if you are only interested in separating out the CJK Unified Ideographs into their own single characters, and keep the rest as a sequence, do the inverse:
>>> re.findall(r'([\u4E00-\u9FFF]|[^\u4E00-\u9FFF]+)', '江河i河流VNX')
['江', '河', 'i', '河', '流', 'VNX']
This related thread has a more extensive discussion in finding Chinese text in a string. You may include additional ranges of characters you wish to group or not to group inside the range expression of the pattern passed to re.findall
.
For dealing with a list of strings, you may wish to apply the argument as ''.join([<various strings>])
, and pass that string to the relevant argument, or if the output need to be distinct, map the input list of strings to re.findall
(e.g. with list comprehension), and then chain them together.