1

I am trying to train the Naive Bayes Classifier with my training data sets which have been classified into positive and negative tweets manually.

I have found plenty of code that trains using the movie_reviews corpus or similar type dataset, but not one in which there are only 2 files, one negative, one positive.

Example code I found:

    import string
    from nltk.corpus import LazyCorpusLoader, 
    CategorizedPlaintextCorpusReader
    from nltk.corpus import stopwords
    my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', 
    cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    mr = my_movie_reviews
    stop = stopwords.words('english')
    documents = [([w for w in mr.words(i) if w.lower() not in stop and  
    w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
    for i in documents: 
           print i

My problem is in the one-liner loop statement. I dont have to deal with fileid in my program, since I have only one file in each category. How can I edit that statement?

My corpus: nltk.data/corpora/my_corpus/negative/negative_tweets.txt - category 1 nltk.data/corpora/my_corpus/positive/positive_tweets.txt - category 2

  • I don't think you need to use the nltk corpus reader for this task. Simple read the files and create your labels – alvas Mar 31 '18 at 10:53
  • Also, take a look at https://stackoverflow.com/questions/29275614/using-my-own-corpus-instead-of-movie-reviews-corpus-for-classification-in-nltk – alvas Mar 31 '18 at 14:41

0 Answers0