Using data set for training and testing in NLTK

Question

I am trying to use Naive Bayes algorithm to do sentimental analysis and was going through a few articles. As mentioned in almost every article I need to train my Naive Bayes algorithm with some pre-computed sentiments.

Now, I have a piece of code using movie_review module provided with NLTK. The code is :

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]


training_set = featuresets[:1900]
testing_set = featuresets[1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

So, in the above code I have a training_set and a testing_set. I checked the movie_review module and inside the movie review module we have many small text files containing reviews.

So, my question is here we had the movie review module and we imported it and trained and tested using the module but how can we do when I am using an external training data set and external testing data set.
Also, how is NLTK parsing movie_review directory which has so many text files inside it. As I will be using http://ai.stanford.edu/~amaas/data/sentiment/ this as my training data set, so I need to understand how its done.

can't you just load the data with open() or similar library methods .. eg. open('data.txt').readlines() will load the data and return the list of lines. — Rajarshee Mitra, Feb 09 '16 at 16:16
@RajarsheeMitra I can open a file like that. But here we have a folder for example the `movie_review` folder which has `positive` and `negative` folders and inside each of them many text files containing reviews. So when we train the data set we train for the main folder and not individual data set. See in the above code, I showed how it's done for the inbuilt `movie_review` folder. — arqam, Feb 09 '16 at 17:43
take a look at http://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python/21126594#21126594 — alvas, Feb 10 '16 at 19:06
You can also look at KNIME as a visual alternative to NLTK, you can make your own Python, R and other modules there. — Xexeo, Feb 26 '16 at 20:01
The way you split your data set is problematic, you should use cross validation for this task in order to train your data on random reviews. The way you split the data to training and testing set is wrong because you can improve your classifier especially for a constant testing set and not for general testing set. — Lior Magen, Apr 04 '16 at 11:57
How do we predict the sentiment using this model ? can you show me code to predict ? — Shanid, May 18 '18 at 07:16

Using data set for training and testing in NLTK

0 Answers0