I have this script to test a regex and how unicode behaves:
# -*- coding: utf-8 -*-
import re
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)
print(w)
And the print
statement is showing this:
[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']
"sucedierón"
is being transformed to "u'sucedier', u'n'"
, and similarly "mañana"
becomes "u'ma', u'ana'"
.
I have tried decoding, adding '\xc3\xb1a'
to the regex for 'Ñ'
Later after reading some docs I realized that using [a-zA-Z]
just matches ASCII character. That is why I had to change to r'\b\w+\b'
so I can add flags to the regex
w = re.findall(r'\b\w+\b', p, re.UNICODE)
But this didn't work.
I also tried to decode()
first and findall()
later:
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')
If I print variable U
"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
I see that the output is as expected, but when I use the findall()
again:
[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']
Now the word is complete but ó
is replaced with \xf3n
and ñ
is replaced with \xf1
, unicode values.
How can I findall()
and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"
I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.
EDIT
I am using python 2.7
EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me