0

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.

I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.

I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier

word_features = list(all_words.keys())[:15000]

testing_set = featuresets[10000:]
training_set = featuresets[:10000]

nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)

nbclassifier.show_most_informative_features(30)

This produces around 45000 words and has an accuracy of 85%.

I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors

Traceback (most recent call last):
  File "foo.py", line 108, in <module>
    print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
  File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
    results = classifier.classify_many([fs for (fs, l) in gold])
  File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
    X = self._vectorizer.transform(featuresets)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
    return self._transform(X, fitting=False)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
    raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.

I don't understand why adding stemming and or removing stop words breaks the classifier?

user2075215
  • 379
  • 5
  • 21
  • That sounds like a rather extreme difference, and it's hard to tell if there is a bug or it's working as it should. But in general, stemming and stop-word removal does not guarantee (or even tend to imply) better performance. – juanpa.arrivillaga Dec 14 '16 at 20:57
  • Removing stopwords and word endings takes you from 45000 words to just 205 words? No way. Examine the filtered text data to figure out what is going wrong with your filtering. – alexis Dec 14 '16 at 23:28

1 Answers1

1

Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:

short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, 'pos' ))

for r in short_neg.split('\n'):
    documents.append( (r, 'neg' ))

all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())

for w in short_neg_words:
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]

I kept running into this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte. You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:

fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
    for line in f:
        pos_lines.append(line)

Unfortunately, then I started getting this error: UnicodeError: UTF-16 stream does not start with BOM

I forget how, but I made this error go away too. Then I started getting the same error as your original question: ValueError: Sample sequence X is empty. When I printed the length of featuresets, I saw it was only 2.

print("Feature sets list length : ", len(featuresets))

After digging on this site, I found these two questions:

  1. Delete every non utf-8 symbols froms string
  2. 'str' object has no attribute 'decode' in Python3

The first one didn't really help, but the second one solved my problem (Note: I'm using ).

I'm not one for one liners, but this worked for me:

pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]

I will update my github repo later this week with the full code for the tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.