Train corpus of Tweets for Sentiment Analysis, using NLTK for Python

Question

I'm trying to train my own corpora for sentiment analysis, using NLTK for python. I have two text files: one has 25K positive tweets, separated per line, the other one 25K negative tweets.

I use this Stackoverflow article, method 2

When I run this code to create corpora:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

I receive error message:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
  File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
    assert self._len is not None
AssertionError

Process finished with exit code 1

Does anyone know how to fix this?

what is the structure of your directory/folder. Can do a `dir C:/Users/gerbuiker/Desktop/Sentiment Analyse`? What is the output of the dir command? — alvas, Apr 22 '15 at 14:09
C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews This folder contains a file 'README.txt' + two subfolders -> 1 'pos', which contains pos.txt, containing 25K lines with positive tweets 1 'neg', which contains neg.txt, containing 25K lines with negative tweets So the same setup as the original movie_review folders, except that my two subfolders contain 1 large textfile each, whereas original movie_review contains a large amount of smaller textfiles with reviews — mvh, Apr 22 '15 at 14:36

score 1 · Answer 1 · answered Apr 22 '15 at 20:22

I'm not 100% positive as I'm not on a Windows machine to test this at the moment, but I think what may be catching you up is the difference between the path slash direction in @alvas original example and your adaptation to windows.

Specifically, you use: 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews' while his example uses '/home/alvas/my_movie_reviews'. For the most part this is fine, but you attempt to re-use his cat_pattern regex: r'(neg|pos)/.*' which will match the slash in his paths but reject the one in yours.

Since I can't test it at the moment, it's also possible the NLTK normalizes the paths somewhere and renders the difference moot, but hopefully this is the issue. — abathur, Apr 22 '15 at 20:23

Train corpus of Tweets for Sentiment Analysis, using NLTK for Python

1 Answers1