I'm trying to tokenize a Chinese Pinyin notation (without tones). Consider the following code:
finals = ['a',
'o',
'e',
'ai',
'ei',
'ao',
'ou',
'an',
'ang',
'en',
'eng',
'er',
'u',
'ua',
'uo',
'uai',
'ui',
'uan',
'uang',
'un',
'ueng',
'ong',
'i',
'i',
'ia',
'ie',
'iao',
'iu',
'ian',
'iang',
'in',
'ing',
'ü',
'üe',
'üan',
'ün',
'iong']
initials = ['p',
'm',
'f',
'd',
't',
'n',
'l',
'g',
'k',
'h',
'j',
'q',
'x',
'z',
'h',
'c',
'h',
's',
'h',
'r',
'z',
'c',
's']
others = ['a',
'o',
'e',
'ai',
'ei',
'ao',
'ou',
'an',
'ang',
'en',
'eng',
'er',
'wu',
'wa',
'wo',
'wai',
'wei',
'wan',
'wang',
'wen',
'weng',
'yi',
'ya',
'ye',
'yao',
'you',
'yan',
'yang',
'yin',
'ying',
'yu',
'yue',
'yuan',
'yun',
'yong']
r = '^((%s)(%s)|%s)+$' % ('|'.join(initials), '|'.join(finals), '|'.join(others))
import re
m = re.match(r, 'yinwei')
print(m.groups())
I was hoping to get ['yin','wei']
(two consecutive outer groups), but for some reason only got 'wei'. Why does this code not work and how to fix it? I also tried the following, but it randomly either gives me ['yin', 'wei']
or ['yi', 'wei]
:
import regex
r = '|'.join({i + f for i in initials for f in finals}.union(set(others)))
print(regex.findall(r, 'yinwei'))
EDIT: I was about to accept this as a duplicate of 4963691 because of ekhumuro's answer, but it doesn't work with bangongshi
as input - instead of ['ban','gong','shi']
we're getting ['bang', 'o', 'shi']
. Because of that, I would like this question to be considered separate from this one.