How to tokenize Pinyin, preferably using nested, overlapping regex groups?

Question

I'm trying to tokenize a Chinese Pinyin notation (without tones). Consider the following code:

finals = ['a',
        'o',
        'e',
        'ai',
        'ei',
        'ao',
        'ou',                                                                                                                                                                       
        'an',                                                                                                                                                                       
        'ang',
        'en',
        'eng',
        'er',
        'u',
        'ua',
        'uo',
        'uai',
        'ui',
        'uan',
        'uang',
        'un',
        'ueng',
        'ong',
        'i',
        'i',
        'ia',
        'ie',
        'iao',
        'iu',
        'ian',
        'iang',
        'in',
        'ing',
        'ü',
        'üe',
        'üan',
        'ün',
        'iong']
initials = ['p',
          'm',
          'f',
          'd',
          't',
          'n',
          'l',
          'g',
          'k',
          'h',
          'j',
          'q',
          'x',
          'z',
          'h',
          'c',
          'h',
          's',
          'h',
          'r',
          'z',
          'c',
          's']
others = ['a',
        'o',
        'e',
        'ai',
        'ei',
        'ao',
        'ou',
        'an',
        'ang',
        'en',
        'eng',
        'er',
        'wu',
        'wa',
        'wo',
        'wai',
        'wei',
        'wan',
        'wang',
        'wen',
        'weng',
        'yi',
        'ya',
        'ye',
        'yao',
        'you',
        'yan',
        'yang',
        'yin',
        'ying',
        'yu',
        'yue',
        'yuan',
        'yun',
        'yong']

r = '^((%s)(%s)|%s)+$' % ('|'.join(initials), '|'.join(finals), '|'.join(others))
import re
m = re.match(r, 'yinwei')
print(m.groups())

I was hoping to get ['yin','wei'] (two consecutive outer groups), but for some reason only got 'wei'. Why does this code not work and how to fix it? I also tried the following, but it randomly either gives me ['yin', 'wei'] or ['yi', 'wei]:

import regex
r = '|'.join({i + f for i in initials for f in finals}.union(set(others)))
print(regex.findall(r, 'yinwei'))

EDIT: I was about to accept this as a duplicate of 4963691 because of ekhumuro's answer, but it doesn't work with bangongshi as input - instead of ['ban','gong','shi'] we're getting ['bang', 'o', 'shi']. Because of that, I would like this question to be considered separate from this one.

in this particular case, neither initials or finals, match your string. But other, matches with both yin and wei, combine that with the greddy operator +$, you consume the whole string, and you get that last match. — Pedro Rodrigues, Feb 09 '19 at 20:25
what you should do instead, is not matching an whole string but rather process the string in bits. Match once, and process the remainder. If you can give simplier example, maybe its easier for me. — Pedro Rodrigues, Feb 09 '19 at 20:26
Possible duplicate of [Python RegEx multiple groups](https://stackoverflow.com/questions/4963691/python-regex-multiple-groups) — Jeff Mercado, Feb 09 '19 at 20:33

ekhumoro · Answer 1 · 2019-02-11T19:01:00.703

The re module does not accumulate groups when used with operators like +. In your example, it will first match 'yin', then match 'wei' - but it will only keep the last set of groups that matched (so m.groups() will only return ['wei', None, None]). However, your regexp will still correctly get the whole match - so m.group() will return 'yinwei'.

It appears that the elements in your lists do not produce any overlapping combinations. That is to say: there is no initials[n] + finals[n] that is duplicated in others. However, there are overlapping elements within each list (e.g. yi|yin|ying in others), but this can be overcome by sorting the lists by descending length.

This means you can quite easily split a pinyin word into its elements like this:

import re

initials.sort(key=len, reverse=True)
finals.sort(key=len, reverse=True)
others.sort(key=len, reverse=True)

r = '(?:%s)(?:%s)|(?:%s)' % ('|'.join(initials), '|'.join(finals), '|'.join(others))
print(re.findall(r, 'yinwei'))

Output:

['yin', 'wei']

UPDATE:

After looking at a reliable source, it seems that your method of parsing pinyin is too simplistic. The table of combinations shows that not all possibilities are valid. It also shows that some combinations are ambiguous (from a purely syntactical point of view). For example, liang can be parsed as either [l + iang], or [l + i], [ang]. Also, not all continuations are valid, so some look-behind assertions will be needed. This suggests that a much more sophisticated approach will be required than simply matching sequentially from left to right. After some searching, I found a previous question that appears to cover the same issues:

Optimizing a regular expression to parse chinese pinyin

However, it seems to be far from straightforward to solve this with a single regexp, so you may want to consider looking for a third-party library that knows how to deal with all the awkward edge cases.

Thanks! Unfortunately the example doesn't seem to work for `bangongshi`. Do you have an idea why? — d33tah, Feb 11 '19 at 11:31
@d33tah It seems the approach suggested in your original question underestimates the complexity of the problem. See my updated answer. — ekhumoro, Feb 11 '19 at 19:07

How to tokenize Pinyin, preferably using nested, overlapping regex groups?

1 Answers1