1

I try to parse with Python3 and the re module strings using the pattern "(c,c,c)" where c is one character to be choosed among (a,b,ë,ɪ̈ ). I wrote something like that :

src="(a,b,ɪ̈)"
pattern = "[abëɪ̈]"
for r in re.finditer( '\({0},{0},{0}\)'.format(pattern), src ):
    print( r.group() )

But the regex doesn't work with ɪ̈; Python analyses ɪ̈ as made of two characters (ɪ + diairesis), id est ɪ plus a diacritic : the regex doesn't know how to read "(a,b,ɪ̈)". I haven't the same problem with ë; Python analyses ë as one character and my regex is able to read "(a,b,ë)", giving the expected answer. I tried to use a normalize approach thanks to unicodedata.normalize('NFD', ...) applied to src and pattern, unsuccessfully.

How shall I solve this problem ? It would be nice to help me !

PS : I fixed some typos thanks to pythonm.

suizokukan
  • 1,303
  • 4
  • 18
  • 33
  • you forgot the : on third line and swept chrs on second – unddoch Sep 04 '12 at 16:24
  • Take a look at this [solution](http://stackoverflow.com/questions/2758921/regular-expression-that-finds-and-replaces-non-ascii-characters-with-python). – David Sep 04 '12 at 16:26
  • this works: `re.findall( r'\({0},{0},ɪ̈\)'.format("[abëɪ̈]"), "(a,b,ɪ̈)")` -> `['(a,b,ɪ̈)']`. Note: `ɪ̈` is matched literally not via `[]`. – jfs Sep 04 '12 at 16:38

1 Answers1

3

You could use | to workaround it:

#!/usr/bin/env python3
import re

print(re.findall(r'\({0},{0},{0}\)'.format("(?:[abë]|ɪ̈)"), "(a,b,ɪ̈)"))
# -> ['(a,b,ɪ̈)']

The above treats ɪ̈ as two characters:

re.compile(r'[abë]|ɪ̈', re.DEBUG)

output:

branch 
  in 
    literal 97
    literal 98
    literal 235
or
  literal 618 
  literal 776 
jfs
  • 399,953
  • 195
  • 994
  • 1,670