I am trying to train the Naive Bayes Classifier with my training data sets which have been classified into positive and negative tweets manually.
I have found plenty of code that trains using the movie_reviews corpus or similar type dataset, but not one in which there are only 2 files, one negative, one positive.
Example code I found:
import string
from nltk.corpus import LazyCorpusLoader,
CategorizedPlaintextCorpusReader
from nltk.corpus import stopwords
my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*', encoding='ascii')
mr = my_movie_reviews
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and
w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
for i in documents:
print i
My problem is in the one-liner loop statement. I dont have to deal with fileid in my program, since I have only one file in each category. How can I edit that statement?
My corpus: nltk.data/corpora/my_corpus/negative/negative_tweets.txt - category 1 nltk.data/corpora/my_corpus/positive/positive_tweets.txt - category 2