I have a text file in spanish, so it has thousands of words, some of them with accents. I'm using re module in order to extract some words, but when I got a list, some words are incomplete.
This is the first part of my code:
projectsinline = open('projectsinline.txt', 'r')
for lines in projectsinline:
pattern = r'\b[a-zA-Z]{6}\b'
words = re.findall(pattern, lines)
print words
This is an example of the output:
['creaci', 'Estado', 'relaci', 'Regula', 'estado', 'comisi', 'delito']
It should be like this:
['creación', 'Estado', 'relación', 'Regula', 'estado', 'comisión', 'delito']
I found this answer: Encode Python list to UTF-8 but it wasn't helpful, because my text comes from a text file, so I couldn't use this code:
import re
import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
projectsinline = open('projectsinline.txt', 'r')
for lines in projectsinline:
pattern = ur'\b[a-zA-Z]{6}\b'
unicode_pattern = re.compile(pattern, re.UNICODE)
result = unicode_pattern.findall(lines)
print result
Now, the output skips words that have accent.
Any suggestions to solve the problem are appreciated?
Thanks!