I have two textfiles: one containing 25K positive tweets, separated on every line, and the second one 25K negative tweets, separated on every line.
How can I use these two text files to create a corpus, in order to classify a new tweet as positive or negative? I want to use the NLTK module for Python.
Edit
The difference with Using my own corpus instead of movie_reviews corpus for Classification in NLTK
is that my data consists of two text files: one with 25K positive tweets, separated on every line. the second one with 25K negative tweets, same separation.
If I use the techniques mentioned in the link above, it doesn't work for me.
When I run this code:
import string; from nltk.corpus import stopwords
from nltk.corpus import CategorizedPlaintextCorpusReader
import traceback
import sys
try:
mr = CategorizedPlaintextCorpusReader('C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
for doc in documents:
print doc
except Exception, err:
print traceback.format_exc()
#or
print sys.exc_info()[0]
I receive error message:
C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py"
Traceback (most recent call last):
File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py", line 17, in <module>
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
assert self._len is not None
AssertionError
<type 'exceptions.AssertionError'>
Does anyone know how to solve this?