4

I'm writing a script to read from a corpus file and find suffixes. Since there are Persian words in the corpus it is UTF-8 encoded, however when I use Persian suffixes for searching I get no results, English results on the other hand back fine.

from __future__ import unicode_literals
import nltk
import sys


for line in open("corpus.txt"):
for word in line.split():
     if word.endswith('ب'):
        print (word)
Andrew Ravus
  • 451
  • 1
  • 7
  • 14
  • What you mean by *i get no results* ? – Mazdak May 07 '15 at 15:15
  • and whats your python version? (it seems that you are in python 3 ) but i need to be sure! – Mazdak May 07 '15 at 15:22
  • I'm using Python 3.4, and I actually get no results in shell as if there are no words in corpus, @Kasra – Andrew Ravus May 07 '15 at 15:34
  • In python 3 you don't need `from __future__ import unicode_literals` too , and your code will works well! but do you have any word in your file that ends with `ب` ? – Mazdak May 07 '15 at 15:41
  • I had imported 'from __future__ import unicode_literals ' but it doesn't work, and I do have words ending in "b". anyways opening file as UTF-8 'with open("corpus.txt", encoding="utf-8") as fp:' worked for me. – Andrew Ravus May 07 '15 at 16:13
  • check [this](http://stackoverflow.com/q/39528462/5284370) out. – Soorena Sep 18 '16 at 13:19

1 Answers1

6

In Python 3, you can just pass encoding=utf-8 to open:

with open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)

In Python 2, you'll need to do something like this:

import codecs
with codecs.open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)
Benjamin Peterson
  • 19,297
  • 6
  • 32
  • 39