Match unicode in ply's regexes

Question

I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.

How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?

I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.

Edit: POSIX character classes doesnt seem to be recognised by python regexes.

>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None

Edit: To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.

Edit: r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.

It also does not appear that Python PCRE understands predicate classes either: \p{IsAlpha} — Axeman, Oct 27 '08 at 03:37

score 5 · Accepted Answer · answered Oct 26 '08 at 21:18

5

the re module supports the \w syntax which:

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

therefore the following examples shows how to match unicode identifiers:

>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>

So the expression you look for is: (?u)[^\W0-9]\w*

answered Oct 26 '08 at 21:18

Florian Bösch

27,420
11
48
53

Now. This is a satisfying solution! – Cheery Oct 26 '08 at 21:22
The quote from the Python documentation is correct, but the examples are misleading. You can simple use the UNICODE flag with \w in stead of the unnecessarily long expression given: `re.match(r'\w+', "ünıcodê", re.UNICODE)` – Walter Oct 26 '08 at 21:53
2

Walter, you have not properly read the question: 1) the goal is an identifier in a programming language, which does not start with 0-9 usually. 2) the parser (ply) takes care of parsing, and it can't be controlled how it will invoke match, therefore (?u) is required. – Florian Bösch Oct 27 '08 at 07:35
Re: controlling how ply invokes match, see Stanislav's answer below – Paul Du Bois Dec 20 '11 at 03:34

score 4 · Answer 2 · answered Dec 14 '11 at 10:26

4

You need pass pass parameter reflags in lex.lex:

lex.lex(reflags=re.UNICODE)

answered Dec 14 '11 at 10:26

Stan

4,169
2
31
39

score 1 · Answer 3 · edited May 23 '17 at 10:27

1

Check the answers to this question

Stripping non printable characters from a string in python

you'd just need to use the other unicode character categories instead

edited May 23 '17 at 10:27

Community

1
1

answered Oct 26 '08 at 16:58

Vinko Vrsalovic

330,807
53
334
373

Cheery · Answer 4 · 2008-10-26T17:26:13.443

Solved it with the help of Vinko.

I realised that getting unicode range is plain dumb. So I'll do this:

symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))
symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))

t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols)

I don't know about unicode character classses. If this unicode stuff starts getting too complicated, I can just put the original one in place. UTF-8 support still ensures the support is on at the STRING tokens, which is more important.

Edit: On other hand, I start understanding why there's not much unicode support in programming languages.. This is an ugly hack, not a satisfying solution.

score 0 · Answer 5 · answered Oct 26 '08 at 16:37

0

Probably POSIX character classes are right for you?

answered Oct 26 '08 at 16:37

Tomalak

332,285
67
532
628

They don't exist in Python's regex engine – Vinko Vrsalovic Oct 26 '08 at 16:49

Match unicode in ply's regexes

5 Answers5

Linked