13

So, I am new to using Python and NLTK. I have a file called reviews.csv which consists of comments extracted from amazon. I have tokenized the contents of this csv file and written it to a file called csvfile.csv. Here's the code :

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import csv #CommaSpaceVariable
from nltk.corpus import stopwords
ps = PorterStemmer()
stop_words = set(stopwords.words("english"))
with open ('reviews.csv') as csvfile:
    readCSV = csv.reader(csvfile,delimiter='.')    
    for lines in readCSV:
        word1 = word_tokenize(str(lines))
        print(word1)
    with open('csvfile.csv','a') as file:
        for word in word1:
            file.write(word)
            file.write('\n')
    with open ('csvfile.csv') as csvfile:
        readCSV1 = csv.reader(csvfile)
    for w in readCSV1:
        if w not in stopwords:
            print(w)

I am trying to perform stemming on csvfile.csv. But I get this error:

  Traceback (most recent call last):<br>
  File "/home/aarushi/test.py", line 25, in <module> <br>
   if w not in stopwords: <br>
  TypeError: argument of type 'WordListCorpusReader' is not iterable
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Aarushi Aiyyar
  • 369
  • 1
  • 5
  • 11
  • 5
    Typo... Use `stop_words`, not `stopwords` – OneCricketeer Oct 28 '17 at 05:42
  • Also, is there really a need to write a file, only to read it back and print it? – OneCricketeer Oct 28 '17 at 05:43
  • 1. I wrote stop_words instead of stopwords. Now I have another error. TypeError: unhashable type: 'list' 2. I wanted the word_tokenized file. That's why I did that. – Aarushi Aiyyar Oct 28 '17 at 07:10
  • 2
    1. Each stackoverflow question should be about one problem. When you move on to the next problem, ask a new question. 2. How could anyone guess where your new error came from? You haven't posted the code. (But don't edit the question or post it in a comment: Ask a new question if you are still stuck.) My guess is you're trying to create a set from the wrong kind of data... – alexis Oct 28 '17 at 12:41
  • You should also make a [mcve]... Your code errors on the first line that isn't importing or declaring a class from a third party library. In other words, you seem not to be testing your code each time you add functionality. You ask about the stemming, but the error is on the stop words. See here for what you're trying to do at the start https://stackoverflow.com/a/19133088/2308683 – OneCricketeer Oct 28 '17 at 18:48

1 Answers1

36

When you did

from nltk.corpus import stopwords

stopwords is the variable that's pointing to the CorpusReader object in nltk.

The actual stopwords (i.e. a list of stopwords) you're looking for is instantiated when you do:

stop_words = set(stopwords.words("english"))

So when checking whether a word in your list of tokens is a stopwords, you should do:

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
for w in tokenized_sent:
    if w not in stop_words:
        pass # Do something.

To avoid confusion, I usually name the actual list of stopwords as stoplist:

from nltk.corpus import stopwords
stoplist = set(stopwords.words("english"))
for w in tokenized_sent:
    if w not in stoplist:
        pass # Do something.
alvas
  • 115,346
  • 109
  • 446
  • 738