NLTK remove stop words from CSV

Question

Though this is a common question, I couldn't find a solution for it that works for my case. I have data, which is comma separated like below.

['my scientific','data']['is comma-separated','frequency']

I'm trying to remove stop words using

from nltk.corpus import stopwords
stopword = stopwords.words('english')
mynewtext = [w for w in transposed if w not in stopword]
out_file.writerow(w)

But it gives me an error saying 'UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal'. I'm not sure where I'm committing a mistake. I want my output in a csv file to be like

scientific,data
comma-separated,frequency

Also, I'd want it to work for both the cases, upper and lower. casefield doesn't work in my Python version 2.7

score 3 · Answer 1 · edited May 23 '17 at 11:44

3

I think you are comparing a str object to a unicode object in the above code.

I suggest you to take a look in the link Python unicode equal comparison failed

>>> s1 = u'Hello'
>>> s2 = unicode("Hello")
>>> type(s1), type(s2)
(<type 'unicode'>, <type 'unicode'>)
>>> s1==s2
True
>>> 
>>> s3='Hello'.decode('utf-8')
>>> type(s3)
<type 'unicode'>
>>> s1==s3
>>>True

edited May 23 '17 at 11:44

Community

1
1

answered Nov 21 '14 at 19:31

Ganesh Pandey

5,216
1
33
39

Thanks for the response. I'm not sure if the way I'm doing it is right, my data is in the variable 'transposed', so according to your answer, I used unicode(transposed) and kept the rest same. Now my output csv file is split into individual letters. – abn Nov 21 '14 at 19:38

score 2 · Accepted Answer · answered Nov 24 '14 at 20:22

Try

# -*- coding: utf-8 -*-,

in the header of your source code.

It tells Python that the source file you've saved is utf-8. The default for Python 2 is ASCII (for Python 3 it's utf-8). This just affects how the interpreter reads the characters in the file.

NLTK remove stop words from CSV

2 Answers2