7

I have a working regular expression to parse PinYin which matches every valid PinYin and does not match invalid ones. I am wondering how I can optimize it.

^(?P<initial>ch|zh|sh|r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z|)
 (?P<final>
   (?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)
  |(?:(?<=zh)uang|(?<=zh)ang|(?<=zh)eng|(?<=zh)ong|(?<=zh)uai|(?<=zh)uan|(?<=zh)ai|(?<=zh)an|(?<=zh)ao|(?<=zh)ei|(?<=zh)en|(?<=zh)ou|(?<=zh)ua|(?<=zh)ui|(?<=zh)un|(?<=zh)uo|(?<=zh)a|(?<=zh)e|(?<=zh)i|(?<=zh)u)
  |(?:(?<=sh)uang|(?<=sh)ang|(?<=sh)eng|(?<=sh)uai|(?<=sh)uan|(?<=sh)ai|(?<=sh)an|(?<=sh)ao|(?<=sh)ei|(?<=sh)en|(?<=sh)ou|(?<=sh)ua|(?<=sh)ui|(?<=sh)un|(?<=sh)uo|(?<=sh)a|(?<=sh)e|(?<=sh)i|(?<=sh)u)
  |(?:(?<=c)ang|(?<=c)eng|(?<=c)ong|(?<=c)uan|(?<=c)ai|(?<=c)an|(?<=c)ao|(?<=c)en|(?<=c)ou|(?<=c)ui|(?<=c)un|(?<=c)uo|(?<=c)a|(?<=c)e|(?<=c)i|(?<=c)u)
  |(?:(?<=b)ang|(?<=b)eng|(?<=b)ian|(?<=b)iao|(?<=b)ing|(?<=b)ai|(?<=b)an|(?<=b)ao|(?<=b)ei|(?<=b)en|(?<=b)ie|(?<=b)in|(?<=b)a|(?<=b)i|(?<=b)o|(?<=b)u)
  |(?:(?<=d)ang|(?<=d)eng|(?<=d)ian|(?<=d)iao|(?<=d)ing|(?<=d)ong|(?<=d)uan|(?<=d)ai|(?<=d)an|(?<=d)ao|(?<=d)ei|(?<=d)en|(?<=d)ia|(?<=d)ie|(?<=d)iu|(?<=d)ou|(?<=d)ui|(?<=d)un|(?<=d)uo|(?<=d)a|(?<=d)e|(?<=d)i|(?<=d)u)
  |(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ai
  |(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)an|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ang
  |(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ao|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)e

Above is an abbreviated version for sake of readability. The whole expression can be found at the end of this post.

I am specifically wondering if passing in two or more prefixes to an ending matcher would improve perfomance:

 (<=ch|zh|sh)uang|(<=ch|zh|sh)ang...

Thanks for your time and suggestions.

whole regex:

 ^(?P<initial>ch|zh|sh|r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z|)(?P<final>(?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)|(?:(?<=zh)uang|(?<=zh)ang|(?<=zh)eng|(?<=zh)ong|(?<=zh)uai|(?<=zh)uan|(?<=zh)ai|(?<=zh)an|(?<=zh)ao|(?<=zh)ei|(?<=zh)en|(?<=zh)ou|(?<=zh)ua|(?<=zh)ui|(?<=zh)un|(?<=zh)uo|(?<=zh)a|(?<=zh)e|(?<=zh)i|(?<=zh)u)|(?:(?<=sh)uang|(?<=sh)ang|(?<=sh)eng|(?<=sh)uai|(?<=sh)uan|(?<=sh)ai|(?<=sh)an|(?<=sh)ao|(?<=sh)ei|(?<=sh)en|(?<=sh)ou|(?<=sh)ua|(?<=sh)ui|(?<=sh)un|(?<=sh)uo|(?<=sh)a|(?<=sh)e|(?<=sh)i|(?<=sh)u)|(?:(?<=c)ang|(?<=c)eng|(?<=c)ong|(?<=c)uan|(?<=c)ai|(?<=c)an|(?<=c)ao|(?<=c)en|(?<=c)ou|(?<=c)ui|(?<=c)un|(?<=c)uo|(?<=c)a|(?<=c)e|(?<=c)i|(?<=c)u)|(?:(?<=b)ang|(?<=b)eng|(?<=b)ian|(?<=b)iao|(?<=b)ing|(?<=b)ai|(?<=b)an|(?<=b)ao|(?<=b)ei|(?<=b)en|(?<=b)ie|(?<=b)in|(?<=b)a|(?<=b)i|(?<=b)o|(?<=b)u)|(?:(?<=d)ang|(?<=d)eng|(?<=d)ian|(?<=d)iao|(?<=d)ing|(?<=d)ong|(?<=d)uan|(?<=d)ai|(?<=d)an|(?<=d)ao|(?<=d)ei|(?<=d)en|(?<=d)ia|(?<=d)ie|(?<=d)iu|(?<=d)ou|(?<=d)ui|(?<=d)un|(?<=d)uo|(?<=d)a|(?<=d)e|(?<=d)i|(?<=d)u)|(?:(?<=g)uang|(?<=g)ang|(?<=g)eng|(?<=g)ong|(?<=g)uai|(?<=g)uan|(?<=g)ai|(?<=g)an|(?<=g)ao|(?<=g)ei|(?<=g)en|(?<=g)ou|(?<=g)ua|(?<=g)ui|(?<=g)un|(?<=g)uo|(?<=g)a|(?<=g)e|(?<=g)u)|(?:(?<=f)ang|(?<=f)eng|(?<=f)iao|(?<=f)an|(?<=f)ei|(?<=f)en|(?<=f)ou|(?<=f)a|(?<=f)o|(?<=f)u)|(?:(?<!sh|ch|zh)(?<=h)uang|(?<!sh|ch|zh)(?<=h)ang|(?<!sh|ch|zh)(?<=h)eng|(?<!sh|ch|zh)(?<=h)ong|(?<!sh|ch|zh)(?<=h)uai|(?<!sh|ch|zh)(?<=h)uan|(?<!sh|ch|zh)(?<=h)ai|(?<!sh|ch|zh)(?<=h)an|(?<!sh|ch|zh)(?<=h)ao|(?<!sh|ch|zh)(?<=h)ei|(?<!sh|ch|zh)(?<=h)en|(?<!sh|ch|zh)(?<=h)ou|(?<!sh|ch|zh)(?<=h)ua|(?<!sh|ch|zh)(?<=h)ui|(?<!sh|ch|zh)(?<=h)un|(?<!sh|ch|zh)(?<=h)uo|(?<!sh|ch|zh)(?<=h)a|(?<!sh|ch|zh)(?<=h)e|(?<!sh|ch|zh)(?<=h)u)|(?:(?<=k)uang|(?<=k)ang|(?<=k)eng|(?<=k)ong|(?<=k)uai|(?<=k)uan|(?<=k)ai|(?<=k)an|(?<=k)ao|(?<=k)en|(?<=k)ou|(?<=k)ua|(?<=k)ui|(?<=k)un|(?<=k)uo|(?<=k)a|(?<=k)e|(?<=k)u)|(?:(?<=j)iang|(?<=j)iong|(?<=j)ian|(?<=j)iao|(?<=j)ing|(?<=j)üan|(?<=j)ia|(?<=j)ie|(?<=j)in|(?<=j)iu|(?<=j)üe|(?<=j)ün|(?<=j)i|(?<=j)ü)|(?:(?<=m)ang|(?<=m)eng|(?<=m)ian|(?<=m)iao|(?<=m)ing|(?<=m)ai|(?<=m)an|(?<=m)ao|(?<=m)ei|(?<=m)en|(?<=m)ie|(?<=m)in|(?<=m)iu|(?<=m)ou|(?<=m)a|(?<=m)e|(?<=m)i|(?<=m)o|(?<=m)u)|(?:(?<=l)iang|(?<=l)ang|(?<=l)eng|(?<=l)ian|(?<=l)iao|(?<=l)ing|(?<=l)ong|(?<=l)uan|(?<=l)ai|(?<=l)an|(?<=l)ao|(?<=l)ei|(?<=l)ia|(?<=l)ie|(?<=l)in|(?<=l)iu|(?<=l)ou|(?<=l)un|(?<=l)uo|(?<=l)üe|(?<=l)a|(?<=l)e|(?<=l)i|(?<=l)o|(?<=l)u|(?<=l)ü)|(?:(?<=n)iang|(?<=n)ang|(?<=n)eng|(?<=n)ian|(?<=n)iao|(?<=n)ing|(?<=n)ong|(?<=n)uan|(?<=n)ai|(?<=n)an|(?<=n)ao|(?<=n)ei|(?<=n)en|(?<=n)ie|(?<=n)in|(?<=n)iu|(?<=n)ou|(?<=n)un|(?<=n)uo|(?<=n)üe|(?<=n)a|(?<=n)e|(?<=n)i|(?<=n)u|(?<=n)ü)|(?:(?<=q)iang|(?<=q)iong|(?<=q)ian|(?<=q)iao|(?<=q)ing|(?<=q)üan|(?<=q)ia|(?<=q)ie|(?<=q)in|(?<=q)iu|(?<=q)üe|(?<=q)ün|(?<=q)i|(?<=q)ü)|(?:(?<=p)ang|(?<=p)eng|(?<=p)ian|(?<=p)iao|(?<=p)ing|(?<=p)ai|(?<=p)an|(?<=p)ao|(?<=p)ei|(?<=p)en|(?<=p)ie|(?<=p)in|(?<=p)ou|(?<=p)a|(?<=p)i|(?<=p)o|(?<=p)u)|(?:(?<=s)ang|(?<=s)eng|(?<=s)ong|(?<=s)uan|(?<=s)ai|(?<=s)an|(?<=s)ao|(?<=s)en|(?<=s)ou|(?<=s)ui|(?<=s)un|(?<=s)uo|(?<=s)a|(?<=s)e|(?<=s)i|(?<=s)u)|(?:(?<=r)ang|(?<=r)eng|(?<=r)ong|(?<=r)uan|(?<=r)an|(?<=r)ao|(?<=r)en|(?<=r)ou|(?<=r)ua|(?<=r)ui|(?<=r)un|(?<=r)uo|(?<=r)e|(?<=r)i|(?<=r)u)|(?:(?<=t)ang|(?<=t)eng|(?<=t)ian|(?<=t)iao|(?<=t)ing|(?<=t)ong|(?<=t)uan|(?<=t)ai|(?<=t)an|(?<=t)ao|(?<=t)ei|(?<=t)ie|(?<=t)ou|(?<=t)ui|(?<=t)un|(?<=t)uo|(?<=t)a|(?<=t)e|(?<=t)i|(?<=t)u)|(?:(?<=w)ang|(?<=w)eng|(?<=w)ai|(?<=w)an|(?<=w)ei|(?<=w)en|(?<=w)a|(?<=w)o|(?<=w)u)|(?:(?<=y)ang|(?<=y)ing|(?<=y)ong|(?<=y)uan|(?<=y)ai|(?<=y)an|(?<=y)ao|(?<=y)in|(?<=y)ou|(?<=y)ue|(?<=y)un|(?<=y)a|(?<=y)e|(?<=y)e|(?<=y)i|(?<=y)o|(?<=y)u)|(?:(?<=x)iang|(?<=x)iong|(?<=x)ian|(?<=x)iao|(?<=x)ing|(?<=x)üan|(?<=x)ia|(?<=x)ie|(?<=x)in|(?<=x)iu|(?<=x)üe|(?<=x)ün|(?<=x)i|(?<=x)ü)|(?:(?<=z)ang|(?<=z)eng|(?<=z)ong|(?<=z)uan|(?<=z)ai|(?<=z)an|(?<=z)ao|(?<=z)ei|(?<=z)en|(?<=z)ou|(?<=z)ui|(?<=z)un|(?<=z)uo|(?<=z)a|(?<=z)e|(?<=z)i|(?<=z)u)|(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ai|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)an|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ang|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ao|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)e|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ei|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)en|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)eng|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)er|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)o|(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)ou))$
Alex Brooks
  • 1,151
  • 1
  • 10
  • 39
tike
  • 2,234
  • 17
  • 19
  • 1
    ^ programming language that is, i stared at that question for a couple minutes before i got it haha - could just be my Monday brain tho – im so confused Jun 17 '13 at 21:22
  • what language do you use? are named groups needed? – Casimir et Hippolyte Jun 17 '13 at 21:22
  • I used pcre compliant synthax to use this RE in any language (and thereby avoid writing this code again and again) my target languages would python, go, C/C++ and javascript as of now. The named groups are not desperately needed, but I'd prefer keeping them. – tike Jun 17 '13 at 21:39
  • You can exclude javascript that has not a pcre compliant regex engine and doesn't support lookbehinds – Casimir et Hippolyte Jun 17 '13 at 22:00
  • @flexy: JavaScript is not pcre-compatible. It neither supports named groups nor lookbehind. – Bergi Jun 18 '13 at 01:43
  • thanks, I'm relatively new to javascript and wasn't aware of that – tike Jun 18 '13 at 02:16
  • Interesting, what would you be using this for? – icedwater Jun 18 '13 at 03:38
  • text parsing and input validation, mainly. I wrote programmatic parsers for this, but figured, that it might be better to do it like this. – tike Jun 18 '13 at 16:28
  • I took a different approach when creating a pinyin regex. http://stackoverflow.com/questions/20736291/regex-for-matching-pinyin/20736292#20736292 – stevendaniels Dec 23 '13 at 02:37

1 Answers1

6

Assuming that your regex engine supports lookbehinds, atomic groups and possessive quantifiers (that are PCRE features):

Some examples of what can be replaced:

  • all (?: by (?>

  • the begining (all the first named group) by:

    ^(?P<initial>(?>[csz]h?+|[bdfghj-npqrtwxy])?)

  • this part* by:

    |(?<![csz]h)(?<=h)(?>a(?>[io]|ng?+)?|e(?>i|ng?+)?|o(?>u|ng)|u(?>[ino]|a(?>i|ng?+)?)?)

*( ie: |(?:(?<!sh|ch|zh)(?<=h)uang|(?<!sh|ch|...|(?<!sh|ch|zh)(?<=h)u) )

  • the last part* by:

    |(?<![bcdfghj-np-tw-z])(?>a(?>[io]|ng?+)?|e(?>[ir]|ng?+)?|ou?+))$

*( ie:|(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|...))$ )

How to deal with the other parts:

example:

(?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)

_ all this kind of parts has the same lookbehind, you must do these steps for each _

# step 1: lookarounds factorization

(?<=ch)(?>ang|eng|ong|uai|uan|ai|an|ao|en|ou|ua|ui|un|uo|a|e|i|u)

# step 2: sort all the content by alphabetic order 

(?<=ch)(?>a|ai|an|ang|ao|e|en|eng|i|ong|ou|u|ua|uai|uan|ui|un|uo)

# step 3: group by first letter: don't forget the ? if the letter can be alone

(?<=ch)(?>a(?>i|n|ng|o)?|e(?>n|ng)?|i|o(?>ng|u)|u(?>a|ai|an|i|n|o)?)

# step 4: reduce the terminations (ie: n & ng => ng?+) 

(?<=ch)(?>a(?>i|ng?+|o)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|i|n|o)?)

# step 5: put single letters in a character class

(?<=ch)(?>a(?>[io]|ng?+)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|[ino])?)

conclusion

Although the result is shorter, the goal here is optimization. I reduced the number of tests with the factorization and the number of backtracks using atomic groups and possessive quantifiers.

some limitations

Note that regex features like atomic groups and possessive quantifiers are not supported by all regex flavors, but it is possible to remedy the problem:

  • for flavors that don't support possessive quantifiers: change ?+ to ?
  • for flavors that don't support atomic groups: change (?> to (?:

(Note that there is a trick to have atomic groups with Python, which you may test with a timer, to surround all the pattern. See this incredible post: Do Python regular expressions have an equivalent to Ruby's atomic grouping?)

Some regex engines such as javascript do not support lookbehinds. In this case, you must rewrite all your pattern using only alternations (ie |), which isn't a bad thing, since lookbehinds make your pattern slower; and give up the named captures that are not supported too. (In this context, it should be noted that to remove negative lookbehinds you need to put syllables described in these parts before all others so that they are matched first.)

other ways of optimization

  • rewrite your pattern without lookbehinds and with | instead
  • sort the different lines by the most used syllables
Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125