python regex to find accented words

Question

Please I need help. I've got a problem when trying to find accented words in a text (in Spanish). I have to search in a large text the first paragraph starting with the words 'Nombre vernáculo'
For example, the text is like: "Nombre vernáculo registrado en la zona de ..."
But accented words are not recoginzed by my python script.

I've tryed with:

re.compile('/(?<!\p{L})(vern[áa]culo*)(?!\p{L})/')
re.compile(r'Nombre vern[a\xc3\xa1]culo\.', re.UNICODE)
re.compile ('[A-Z][a-záéíóúñ]+')
\p{Lu}] [\p{Ll}]+ \b

I've read the following threads:

grep/regex can't find accented word
Python Regex strange behavior with accented characters
Python regex and accented Expression
Python: using regex and tokens with accented chars (negative lookbehind)

Also I found something that almost work:

In [95]: dd=re.search(r'^\w.*', 'Nombre vernáculo' )
In [96]: dd.group(0)
Out[96]: 'Nombre vern\xc3\xa1culo'

But it also returns all accented words in the text.

Any help with this will be appreciaded. Thanks.

Also, what are these regexes supposed to do? The first one has a `\p` in it, which doesn't mean anything in Python strings or in Python regexes. The second one either has UTF-8 bytes crarmmed into the string as if they were characters (if Python 3), or searching for any one of the bytes `a`, `\xc3`, or `\xa1` (if Python 2), neither of which is very useful. The third one doesn't seem to be even remotely related to the problem you're trying to solve. The fourth one isn't even Python. — abarnert, Jun 14 '18 at 00:24
Is there a reason you have to use Python 2.7? Because making stuff like this easier is a big part of the reason Python 3 exists. — abarnert, Jun 14 '18 at 00:29
Thanks for your reply. I dont use python 3 because all plataform works with python 2.7, I dont know how to do it in python 3 either. I´m just a beginner. — Angela Chek, Jun 14 '18 at 00:34
I don't know of any platforms that work with Python 2.7 but don't work with Python 3. Meanwhile, if you're a beginner, it's much better to learn the more popular, easier-to-learn language with a future than the old one that's less than a year and a half from final end-of-life. — abarnert, Jun 14 '18 at 00:38
Are you reading the text from a Unicode file? Please show the whole relevant code. Also, are you using it in a Jupyter Notebook in Windows? Try `re.compile(ur'\bvern[áa]culo\b', re.UNICODE)` to find a whole word `vernáculo` or `vernaculo`. `for x in rx.findall(s): print(x)` in my Linix shows a valid `vernáculo` result. — Wiktor Stribiżew, Jun 14 '18 at 09:34
@AngelaChek Please use `@` + username in the comment to let the user know of the feedback. — Wiktor Stribiżew, Jun 14 '18 at 09:42

abarnert · Accepted Answer · 2018-06-14T00:41:47.943

The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode instead of str objects, include u-prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.

# -*- coding: utf-8 -*-

import re

pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match

Notice that I left off the \. on the end of the pattern. Your text doesn't end in a ., so you shouldn't be looking for one, or it's going to fail.

Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8') it, or io.open or codecs.open the file instead of just open, etc.

If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:

import re

pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match

If you have to use str, then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \xc3\xa1. So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:

pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match

Thank you. The second option works perfectly. You help me so much. — Angela Chek, Jun 14 '18 at 17:10

score -1 · Answer 2 · answered Jun 14 '18 at 00:33

-1

import re
r1 = re.compile(r'(Nombre vernáculo)')
x = 'Nombre vernáculo registrado en la zona de'
match = r1.search(x)
print(match.group(1))

with python 2:

/tmp> python2 test.py
  File "test.py", line 5
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

with python 3:

/tmp> python3 test.py 
Nombre vernáculo

answered Jun 14 '18 at 00:33

Jonah

727
5
12

How is it helpful to give someone code that raises a `SyntaxError`? – abarnert Jun 14 '18 at 00:42
@Jonah Caplan Thanks, I dont know why but I didnt get the error you wrote. It also works in my python version for finding the words. Thanks. – Angela Chek Jun 14 '18 at 17:14

python regex to find accented words

2 Answers2

Linked