0

I have two textfiles: one containing 25K positive tweets, separated on every line, and the second one 25K negative tweets, separated on every line.

How can I use these two text files to create a corpus, in order to classify a new tweet as positive or negative? I want to use the NLTK module for Python.

Edit

The difference with Using my own corpus instead of movie_reviews corpus for Classification in NLTK

is that my data consists of two text files: one with 25K positive tweets, separated on every line. the second one with 25K negative tweets, same separation.

If I use the techniques mentioned in the link above, it doesn't work for me.

When I run this code:

import string; from nltk.corpus import stopwords
from nltk.corpus import CategorizedPlaintextCorpusReader
import traceback
import sys

try:
    mr = CategorizedPlaintextCorpusReader('C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    stop = stopwords.words('english')
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

    for doc in documents:
        print doc
except Exception, err:
    print traceback.format_exc()
    #or
    print sys.exc_info()[0]

I receive error message:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py"
    Traceback (most recent call last):
      File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/haha.py", line 17, in <module>
        documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
      File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
        assert self._len is not None
    AssertionError

    <type 'exceptions.AssertionError'>

Does anyone know how to solve this?

Community
  • 1
  • 1
mvh
  • 189
  • 1
  • 2
  • 20
  • There are a lot of tutorials on using natural language toolkit, what have you tried? what parts are you having trouble with? Have you Googled for [similar problems and questions](http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk) and how are these not helping? – LinkBerest Apr 21 '15 at 14:54
  • I can't really find out how to train a corpus for which one of the text files is used for the POS class & the other text file for the NEG class – mvh Apr 21 '15 at 15:38
  • 1
    also check out: http://stackoverflow.com/questions/29275614/using-my-own-corpus-instead-of-movie-reviews-corpus-for-classification-in-nltk/29281180#29281180 . Hint: use `CategorizedPlaintextCorpusReader` – alvas Apr 21 '15 at 15:56
  • 1
    Edited my question. Even if I follow the steps in alvas' link, I dont get it working. – mvh Apr 22 '15 at 13:48
  • 1
    print out documents and check what it contains (error means an empty set was encountered). Also try printing a [full traceback](http://stackoverflow.com/questions/3702675/print-the-full-traceback-in-python-without-halting-the-program) to see what is causing the problem better. – LinkBerest Apr 22 '15 at 14:42
  • @JGreenwell Did what you suggested, see my edit. – mvh Apr 22 '15 at 14:53

0 Answers0