how regex in Python3 deal with diacritics?

Question

I try to parse with Python3 and the re module strings using the pattern "(c,c,c)" where c is one character to be choosed among (a,b,ë,ɪ̈ ). I wrote something like that :

src="(a,b,ɪ̈)"
pattern = "[abëɪ̈]"
for r in re.finditer( '\({0},{0},{0}\)'.format(pattern), src ):
    print( r.group() )

But the regex doesn't work with ɪ̈; Python analyses ɪ̈ as made of two characters (ɪ + diairesis), id est ɪ plus a diacritic : the regex doesn't know how to read "(a,b,ɪ̈)". I haven't the same problem with ë; Python analyses ë as one character and my regex is able to read "(a,b,ë)", giving the expected answer. I tried to use a normalize approach thanks to unicodedata.normalize('NFD', ...) applied to src and pattern, unsuccessfully.

How shall I solve this problem ? It would be nice to help me !

PS : I fixed some typos thanks to pythonm.

Take a look at this [solution](http://stackoverflow.com/questions/2758921/regular-expression-that-finds-and-replaces-non-ascii-characters-with-python). — David, Sep 04 '12 at 16:26
this works: `re.findall( r'\({0},{0},ɪ̈\)'.format("[abëɪ̈]"), "(a,b,ɪ̈)")` -> `['(a,b,ɪ̈)']`. Note: `ɪ̈` is matched literally not via `[]`. — jfs, Sep 04 '12 at 16:38

jfs · Accepted Answer · 2012-09-04T16:56:22.127

3

You could use | to workaround it:

#!/usr/bin/env python3
import re

print(re.findall(r'\({0},{0},{0}\)'.format("(?:[abë]|ɪ̈)"), "(a,b,ɪ̈)"))
# -> ['(a,b,ɪ̈)']

The above treats ɪ̈ as two characters:

re.compile(r'[abë]|ɪ̈', re.DEBUG)

output:

branch 
  in 
    literal 97
    literal 98
    literal 235
or
  literal 618 
  literal 776

edited Sep 04 '12 at 16:56

answered Sep 04 '12 at 16:44

jfs

399,953
195
994
1,670

An interesting answer to my question. Thank you for your help. – suizokukan Sep 04 '12 at 17:00
I obviously misunderstood what the regular expression was intended to be. Good answer. – Stumpy Joe Pete Sep 04 '12 at 19:12

how regex in Python3 deal with diacritics?

1 Answers1