Why don't I find the words in their original source list?

Question

I am trying to find chinesse words in two differnet files, but It didn't work so I tried to search for the words in the same file I get them from, but it seems it doesn't find it neither? how is it possible?

chin_split = codecs.open("CHIN_split.txt","r+",encoding="utf-8")

used this for the regex code.

import re
for n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read()):
    print n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read())

how comes I get only falses printed???

FYI I tried to do this and it works:

for x in [1,2,3,4,5,6,6]:
    print x in [1,2,3,4,5,6,6]

BTW

chin_split contains words in English Hebrew and Chinese

some lines from chin_split.txt:

 he daodan   核导弹     טיל גרעיני     
 hedantou    核弹头     ראש חץ גרעיני      
 helu    阖庐  "ביתו, מעונו 
 helu    阖庐   שם מלך וו בתקופת ה'אביב והסתיו'"      
 huiwu   会晤  להיפגש עם

If you could, switch to Python 3, which has better support for Unicode. — nhahtdh, Aug 25 '12 at 11:56

score 3 · Accepted Answer · answered Aug 25 '12 at 11:58

You are reading a file descriptor many times and that is wrong.

The first chin_split.read() will yield all the content but the others (inside the loop) will just get an empty string.

That loop makes no sense, but if you want to keep it, save the file content in a variable first.

Why don't I find the words in their original source list?

1 Answers1