1

Like here and here, i run this code:

with open(fin,'r') as inFile, open(fout,'w') as outFile:
  for line in inFile:
     line = line.replace('."</documents', '"').replace('. ', ' ')
     print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)

and i have the following error:

**UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)**

How can i solve this?

Community
  • 1
  • 1
bass
  • 23
  • 1
  • 6

1 Answers1

3

word not in stopwords.words('english') uses comparisons. Either word or at least one of the values in stopwords.words('english') is not a Unicode value.

Since you are reading from a file, the most likely candidate here is word; decode it, or use a file object that decodes data as it is being read:

print(' '.join([word for word in line.lower().split()
                if len(word) >=3 and
                   word.decode('utf8') not in stopwords.words('english')]),
      file = outFile)**

or

import io

with io.open(fin,'r', encoding='utf8') as inFile,\
        io.open(fout,'w', encoding='utf8') as outFile:

where the io.open() function gives you a file object in text mode that encodes or decodes as required.

The latter is less error-prone. For example, you test the length of word, but what you are really testing there is the number of bytes. Any word containing characters outside of the ASCII codepoint range will result in more than one UTF-8 byte per character, so len(word) is not the same thing as len(word.decode('utf8')).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343