3

I have a text file in spanish, so it has thousands of words, some of them with accents. I'm using re module in order to extract some words, but when I got a list, some words are incomplete.

This is the first part of my code:

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = r'\b[a-zA-Z]{6}\b'
    words = re.findall(pattern, lines)

    print words

This is an example of the output:

['creaci', 'Estado', 'relaci', 'Regula', 'estado', 'comisi', 'delito']

It should be like this:

['creación', 'Estado', 'relación', 'Regula', 'estado', 'comisión', 'delito']

I found this answer: Encode Python list to UTF-8 but it wasn't helpful, because my text comes from a text file, so I couldn't use this code:

import re
import codecs
import sys

sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = ur'\b[a-zA-Z]{6}\b'
    unicode_pattern = re.compile(pattern, re.UNICODE)
    result = unicode_pattern.findall(lines)
    print result

Now, the output skips words that have accent.

Any suggestions to solve the problem are appreciated?

Thanks!

Community
  • 1
  • 1
estebanpdl
  • 1,213
  • 1
  • 12
  • 31
  • What are you trying to do with the `{6}` in your regex pattern? – happydave Mar 02 '16 at 01:27
  • Does `re.compile(r"\w+", re.UNICODE)` work for your case? – univerio Mar 02 '16 at 01:30
  • {6} finds words with 6 letters only – estebanpdl Mar 02 '16 at 01:37
  • 2
    I feel like I must be missing something. Why then does your "It should be like this" list include a bunch of entries with more than 6 letters? – happydave Mar 02 '16 at 01:40
  • @univerio if I use `re.compile(r"\w+", re.UNICODE)` it down't work either, and I get another alphanumeric tags, which I do not need. – estebanpdl Mar 02 '16 at 01:48
  • @happydave because when re module finds an accent it breaks the word, that's why the output shows incomplete words. For example, it didn't find `[ '...', 'código', '...' ]` which contains 6 letters. – estebanpdl Mar 02 '16 at 01:52
  • So let me get this straight. You want to find words that have six consecutive *non-accented* letters in them, regardless of how many *accented* letters they have? – univerio Mar 02 '16 at 01:58
  • @univerio, I want to find specific words in the text file, regardless how many have _accented_ letters or _non-accented_ letters. – estebanpdl Mar 02 '16 at 02:00
  • @estebanpdl So there's absolutely no reason you put `{6}` in your pattern? – univerio Mar 02 '16 at 02:12
  • @univerio, In some way it has sense, because actually I'm using this pattern `{4,20}` What I want is to skip connectors or words in spanish similar to _the_, _as_, _or_, _and_, _if_, _is_, etc – estebanpdl Mar 02 '16 at 02:18

1 Answers1

4

You are picking the words with 6 letters by using this r'\b[a-zA-Z]{6}\b', some of the words in your example have more letters and those letters get cut off because your special symbols are considered as not word characters and word boundary works out.

I would use \w instead if you want all words with 6 letters.

will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

import re
import codecs

with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
    for line in f:
        unicode_pattern = re.compile(r'\b\w{6}\b', re.UNICODE)
        result = unicode_pattern.findall(line)
        for word in result:
            print word

Example string:

creación, longstring, lación, Regula, estado, misión

Output:

lación
Regula
estado
misión
midori
  • 4,807
  • 5
  • 34
  • 62
  • Exactly. My bad. The output shows incomplete words, but it shouldn't show them, it should find words like: `[ '...', 'código', '...' ]` for example. – estebanpdl Mar 02 '16 at 01:58
  • Thanks a lot, @minitoto. It works, but the output looks like this: `[...'T\xedtulo', '\xfaltimo', 'C\xf3digo', 'Fiscal', 'emitir', 'Fiscal', 'C\xf3digo'...]` – estebanpdl Mar 02 '16 at 02:05
  • i added some changes, you'll have unicode elements in the list – midori Mar 02 '16 at 02:27
  • Thanks for your help, @minitoto. I tried the _code_ but something is wrong, I got this error: `UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 22: invalid continuation byte` – estebanpdl Mar 02 '16 at 02:34
  • i would need your input file, or part of it where you're having difficulties in order to help you – midori Mar 02 '16 at 02:35
  • You mean the text file? I upload projectsinline.txt to GitHub: https://github.com/estebanpdl/programaeva/blob/master/files/projectsinline.txt – estebanpdl Mar 02 '16 at 02:46
  • checked your file with last updated code and it works without any errors – midori Mar 02 '16 at 02:49
  • Thanks for your support, @minitoto. I got it. The text file was with ANSI codification. Now it's ok and it works :) – estebanpdl Mar 02 '16 at 03:02
  • You are welcome, if answer helped you you can accept it by clicking v – midori Mar 02 '16 at 03:08
  • It helps me a lot. I did vote, but I'm new here and it says I need 15 reputation. I have 6 for now :( – estebanpdl Mar 02 '16 at 03:18
  • you don't need reputation to accept an answer, there is a v button under the answer's score, just click it – midori Mar 02 '16 at 03:19
  • Sorry :S Thanks, again! – estebanpdl Mar 02 '16 at 03:22