My problem is to classify documents given two training data good_reviews.txt
and bad_reviews.txt
. So to start I need to load and label my training data where every line is a document itself which corresponds to a review. So my main task is to classify reviews (lines) from a given testing data.
I found a way how to load and label names data as follow:
from nltk.corpus import names
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
So what I want to have is a similar thing which labels lines
and not words
.
I am expecting that the code would be something like this which of course doesn't work since .lines
is an invalid syntax:
reviews = ([(review, 'good_review') for review in reviews.lines('good_reviews.txt')] +
[(review, 'bad_review') for review in reviews.lines('bad_reviews.txt')])
and I would like to have a result like this:
>>> reviews[0]
('This shampoo is very good blablabla...', 'good_review')