Pyparsing for unicode letters

Question

I need to use pyparsing for unicode characters. So I tried simple example from their github repository with French character cédille and gives error.

My code

from pyparsing import Word, alphas
greet = Word(alphas) + "," + Word(alphas) + "!"
hello = "Hello, cédille!"
greet.parseString(hello)

and it gives error

pyparsing.ParseException: Expected "!" (at char 8), (line:1, col:9)

Is there a way to solve this problem?

`alphas` appears to be pure ASCII only. There's a definition `alphas8bit` which is either misnamed or also not helpful. — tripleee, Nov 29 '19 at 14:37
`alphas8bit` dates back to early Python2 time, when the 128-255 alpha characters (with bit8 set, hence the name) were added. — PaulMcG, Nov 29 '19 at 16:18

score 2 · Answer 1 · answered Nov 29 '19 at 16:24

Pyparsing has the pyparsing_unicode module that defines a number of unicode character ranges with definitions for alphas, nums, and so on within each range. Ranges include CJK, Cyrillic, Devanagari, Hebrew, Arabic, and others. The greetingInGreek.py and greetingInKorean.py examples in the examples directory show a couple of these in action.

Your example, using the Latin1 set, will look like:

from pyparsing import Word, pyparsing_unicode as ppu
intl_alphas = ppu.Latin1.alphas
greet = Word(intl_alphas) + "," + Word(intl_alphas) + "!"
hello = "Hello, cédille!"
print(greet.parseString(hello))

Prints:

['Hello', ',', 'cédille', '!']

alphas8bit will probably be kept for legacy support, but new applications should use pyparsing_unicode.Latin1.alphas.

In python 2.7 this still gives error `pyparsing.ParseException: Expected "!", found '\xa9' (at char 9), (line:1, col:10)` — asdfkjasdfjk, Dec 02 '19 at 08:23
That looks like you forgot to declare the source file's encoding, or failed to mark the string as a Unicode string. — tripleee, Dec 02 '19 at 08:57

score 1 · Answer 2 · answered Nov 29 '19 at 14:53

1

alphas is apparently English / pure ASCII only. The following appears to work:

from pyparsing import Word, alphas, alphas8bit
greet = Word(alphas+alphas8bit) + "," + Word(alphas+alphas8bit) + "!"
hello = "Hello, cédille!"
greet.parseString(hello)

This is Unicode, so there is nothing particularly "8-bit" about the character é; but if the documentation is at least approximately correct, I guess it will still break with slightly more exotic accented characters (anything not available in Latin-1, like Czech or Polish accented characters, or go extreme and try Vietnamese).

Maybe explore the unicodedata module to get a proper enumeration of "alphabetic" characters, or find a third-party module which exposes this Unicode feature properly.

answered Nov 29 '19 at 14:53

tripleee

175,061
34
275
318

Indeed, `"Hello, cǔdille!"` still gives me a traceback. – tripleee Nov 29 '19 at 14:57
See https://stackoverflow.com/questions/3094498/how-can-i-check-if-a-python-unicode-string-contains-non-western-letters for a couple of approaches, or https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words for some other ideas. I *think* the accepted answer at https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python can be adapted to do what I propose, but I haven't investigated properly. – tripleee Nov 29 '19 at 15:06
In python 2.7 still get error `pyparsing.ParseException: Expected "!", found '\xa9' (at char 9), (line:1, col:10)` – asdfkjasdfjk Dec 02 '19 at 08:33

Pyparsing for unicode letters

2 Answers2