8

I have this script to test a regex and how unicode behaves:

# -*- coding: utf-8 -*-
import re

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)

print(w)

And the print statement is showing this:

[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']

"sucedierón" is being transformed to "u'sucedier', u'n'", and similarly "mañana" becomes "u'ma', u'ana'".

I have tried decoding, adding '\xc3\xb1a' to the regex for 'Ñ'

Later after reading some docs I realized that using [a-zA-Z] just matches ASCII character. That is why I had to change to r'\b\w+\b' so I can add flags to the regex

w = re.findall(r'\b\w+\b', p, re.UNICODE) 

But this didn't work.

I also tried to decode() first and findall() later:

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')

If I print variable U

"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

I see that the output is as expected, but when I use the findall() again:

[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']

Now the word is complete but ó is replaced with \xf3n and ñ is replaced with \xf1, unicode values.

How can I findall() and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"

I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.

EDIT

I am using python 2.7

EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me

arturomp
  • 28,790
  • 10
  • 43
  • 72
NachoMiguel
  • 983
  • 2
  • 15
  • 29

3 Answers3

4

Regex with accented characters (diacritics) in Python

The re.UNICODE flag allows you to use word characters \w and word boundaries \b with diacritics (accents and tildes). This is extremely useful to match words in different languages.

  1. Decode your text from UTF-8 to
  2. Make sure the pattern and the subject text are passed as to the regex functions.
  3. The result is an array of bytes that can be looped/mapped to encode back again to UTF-8
  4. Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.

Code:

# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9

import re

text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')

matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)

# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]

# Print every word
for utf8_word in utf8_matches:
    print utf8_word

ideone Demo

Mariano
  • 6,423
  • 4
  • 31
  • 47
  • 1
    Ah, I missed OP's second solution. You might want to mention that the solution the OP has with `r'\b\w+\b'` is already correct. – nhahtdh Oct 01 '15 at 08:38
2

Your code should be written as:

w = re.findall(u'[a-zA-ZÑñ]+', p.decode('utf-8'))

Please add other characters into the character class on your own, since I don't know the full set of characters you want to match.

When you are processing Unicode text, make sure that both the input string and the pattern are of unicode1 type.

1 unicode is logically an array of UTF-16 code units (in narrow build) or UTF-32 code units/code points (in wide build). If you intend to process Unicode text with Python, to avoid the issue with astral plane characters in narrow builds, I recommend using Python 3.3 and above, or always use wide build for other version.

In Python 2, str is simply an array of bytes, so characters outside ASCII range in the pattern will simply be interpreted as the sequence of bytes making up that character in the source encoding:

>>> [i for i in '[a-zA-ZÑñ]+']
['[', 'a', '-', 'z', 'A', '-', 'Z', '\xc3', '\x91', '\xc3', '\xb1', ']', '+']  

Compare output of re.DEBUG when compiling the str and unicode object:

>>> re.compile('[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
  in
    range (97, 122)
    range (65, 90)
    literal 195      # \xc3
    literal 145      # \x91
    literal 195
    literal 177
<_sre.SRE_Pattern object at 0x6fffffd0dd8>

>>> re.compile(u'[a-zA-ZÑñ]+', re.DEBUG)
max_repeat 1 4294967295
  in
    range (97, 122)
    range (65, 90)
    literal 209      # Ñ
    literal 241      # ñ
<_sre.SRE_Pattern object at 0x6ffffded030>

Since you are not using \s, \w, \d, re.UNICODE flag has no effect and can be removed.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

It works for me. I use Pycharm and i have set the console to utf-8.

You need to configure your output console to utf-8 ....

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('ñ',p, re.UNICODE)

print(w)

['ñ', 'ñ']

w = re.findall('[a-zA-ZÑñó:]+',p, re.UNICODE)

print(w)

['Solo', 'voy', 'si', 'se', 'sucedierón', 'o', 'se', 'suceden', 'mañana', 'los', 'siguienñes', 'eventos:']
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
  • 1
    I am sorry, how can i set the console to utf-8? – NachoMiguel Sep 30 '15 at 18:25
  • @NachoMiguel It depends on your environment. every editor uses its own of system variables. For PYcharm it is -Dconsole.encoding=UTF-8 in vmoptions file as the last line. – LetzerWille Sep 30 '15 at 18:29
  • do you know if i use this on a website the letters ñ and the accents will be correctly displayed? – NachoMiguel Sep 30 '15 at 18:56
  • just did this. It didn't work. Thanks for the advice – NachoMiguel Sep 30 '15 at 19:20
  • @NachoMiguel. I see that you use p.decode ... It is used in Python 3. in Python 2.x you should use instead unicode(p). Also run sys.getdefaultencoding() to see the default config for you sistem. It may not be utf-8 – LetzerWille Sep 30 '15 at 19:25
  • Yes sir, you are right i get "ascii". I have changed the file: `-Dconsole.encoding=utf-8` i added this in pycharm.exe.vmoptions, and i restarted the ide a lot of times now. I do not know why i can not change it – NachoMiguel Sep 30 '15 at 19:29
  • Go to settings --> file encodings. Leave IDE encoding as it is. Change Project encoding to utf-8, change default encoding for property files to utf-8. – LetzerWille Sep 30 '15 at 19:35
  • Same. If i cannot find the solution i will send feedback to the pycharm guys. – NachoMiguel Sep 30 '15 at 19:40