0

The words of the "wordslist" and the text I'm searching are in Cyrillic. The text is coded in UTF-8 (as set in Notepad++). I need Python to match a word in the text and get everything after the word until a full-stop followed by new line.

EDIT

with open('C:\....txt', 'rb') as f:
    wordslist = []
    for line in f:
        wordslist.append(line) 

wordslist = map(str.strip, wordslist)

/EDIT

for i in wordslist:
    print i #so far, so good, I get Cyrillic
    wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
    wantedtext = str(wantedtext)
    print wantedtext

"Wantedtext" shows and saves as "\xd0\xb2" (etc.).

What I tried:

This question is different, because there is no variable involved: Convert bytes to a python string. Also, the solution from the chosen answer

wantedtext.decode('utf-8')

didn't work, the result was the same. The solution from here didn't help either.

EDIT: Revised code, returning "[]".

with io.open('C:....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 

for i in wordslist:
    print i
    with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
        print my_file_test #works, prints cyrillic characters, but...


        wantedtext = re.findall(i+".*\.\r\n", my_file_test)
        wantedtext = str(wantedtext)

        print wantedtext #returns []

(Added after a comment below: This code works if you erase \r from the regular expression.)

Community
  • 1
  • 1
Tag
  • 51
  • 1
  • 9

1 Answers1

0

Python 2.x only

Your find is probably not working because you're mixing strs and Unicodes strs, or strs containing different encodings. If you don't know what the difference between Unicode str and str, see: https://stackoverflow.com/a/35444608/1554386

Don't start decoding stuff unless you know what you're doing. It's not voodoo :)

You need to get all your text into Unicode objects first.

  1. Split your read into a separate line - it's easier to read
  2. Decode your text file. Use io.open() which support Python 3 decoding. I'm going assume your text file is UTF-8 (We'll soon find out if it's not):

    with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
    

    my_file_test is now a Unicode str

  3. Now you can do:

    # finds lines beginning with i, ending in .
    regex = u'^{i}*?\.$'.format(i=i)
    wantedtext = re.findall(regex, my_file_test, re.M)
    
  4. Look at wordslist. You don't say what you do with it but you need to make sure it's a Unicode str too. If you read from a file, use the same io.open from above.

Edit:

For wordslist, you can decode and read the file into a list while removing line feeds in one go:

with io.open('C:\....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 
Community
  • 1
  • 1
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • I get NameError: name 'io' is not defined. Do I need a package to use io.open? – Tag Feb 13 '17 at 18:17
  • yes, it's often implied on Stack Overflow when you see an unqualified package. `import io` – Alastair McCormack Feb 13 '17 at 18:22
  • The program is running, it prints a word from a list, and then no result - []. – Tag Feb 13 '17 at 18:46
  • sorry, I screwed the `wordlist` logic during my edit. Please see update with `io.open()`. If it still doesn't work, then you'll have to provide more information – Alastair McCormack Feb 13 '17 at 18:50
  • It looks like you're trying find all lines which start with a word and ends with `.`. Right? Because you're now using a file reader that supports Universal new line, you need to use proper regex line markers – Alastair McCormack Feb 13 '17 at 20:18
  • Good stuff. Check out my latest changes though - It uses the `re.M` flag for multiline and a non-greedy search – Alastair McCormack Feb 13 '17 at 20:35