python UnicodeWarning: Unicode equal comparison. How to solve this error?

Question

Like here and here, i run this code:

with open(fin,'r') as inFile, open(fout,'w') as outFile:
  for line in inFile:
     line = line.replace('."</documents', '"').replace('. ', ' ')
     print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)

and i have the following error:

**UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)**

How can i solve this?

Martijn Pieters · Accepted Answer · 2015-01-19T11:59:21.560

word not in stopwords.words('english') uses comparisons. Either word or at least one of the values in stopwords.words('english') is not a Unicode value.

Since you are reading from a file, the most likely candidate here is word; decode it, or use a file object that decodes data as it is being read:

print(' '.join([word for word in line.lower().split()
                if len(word) >=3 and
                   word.decode('utf8') not in stopwords.words('english')]),
      file = outFile)**

or

import io

with io.open(fin,'r', encoding='utf8') as inFile,\
        io.open(fout,'w', encoding='utf8') as outFile:

where the io.open() function gives you a file object in text mode that encodes or decodes as required.

The latter is less error-prone. For example, you test the length of word, but what you are really testing there is the number of bytes. Any word containing characters outside of the ASCII codepoint range will result in more than one UTF-8 byte per character, so len(word) is not the same thing as len(word.decode('utf8')).

thank you @martijn-pieters, `word.decode('utf8')` works well! — bass, Jan 19 '15 at 11:57
@user275832: I'd use the second method; deal directly with Unicode values, not with the UTF-8 bytes. — Martijn Pieters, Jan 19 '15 at 12:04

python UnicodeWarning: Unicode equal comparison. How to solve this error?

1 Answers1