using unicode in python for Farsi

Question

I'm writing a script to read from a corpus file and find suffixes. Since there are Persian words in the corpus it is UTF-8 encoded, however when I use Persian suffixes for searching I get no results, English results on the other hand back fine.

from __future__ import unicode_literals
import nltk
import sys


for line in open("corpus.txt"):
for word in line.split():
     if word.endswith('ب'):
        print (word)

and whats your python version? (it seems that you are in python 3 ) but i need to be sure! — Mazdak, May 07 '15 at 15:22
I'm using Python 3.4, and I actually get no results in shell as if there are no words in corpus, @Kasra — Andrew Ravus, May 07 '15 at 15:34
In python 3 you don't need `from __future__ import unicode_literals` too , and your code will works well! but do you have any word in your file that ends with `ب` ? — Mazdak, May 07 '15 at 15:41
I had imported 'from __future__ import unicode_literals ' but it doesn't work, and I do have words ending in "b". anyways opening file as UTF-8 'with open("corpus.txt", encoding="utf-8") as fp:' worked for me. — Andrew Ravus, May 07 '15 at 16:13
check [this](http://stackoverflow.com/q/39528462/5284370) out. — Soorena, Sep 18 '16 at 13:19

Benjamin Peterson · Accepted Answer · 2015-05-07T15:46:27.533

6

In Python 3, you can just pass encoding=utf-8 to open:

with open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)

In Python 2, you'll need to do something like this:

import codecs
with codecs.open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)

edited May 07 '15 at 15:46

answered May 07 '15 at 15:08

Benjamin Peterson

19,297
6
32
39

using unicode in python for Farsi

1 Answers1