ValueError: too many values to unpack (NLTK classifier)

Question

I'm doing classification analysis using NLTK's Naive Bayes classifier. I insert a tsv file containing records and labels.

But the file doesn't get trained due to an error. Here's my python code

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)

dataset.isnull().any()

dataset = dataset.fillna(method='ffill')

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,16004):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    ps = PorterStemmer()
    tweet = [ps.stem(word) for word in tweet if not word in 
    set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    corpus.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values




from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
random_state = 0)
train_set, test_set = X_train[500:], y_train[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

The error is:

File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:

ValueError: too many values to unpack

It's a variable of the function NaiveBayesClassifier.train() — mustang, Mar 06 '18 at 04:33

Vivek Kumar · Accepted Answer · 2018-03-15T10:21:39.007

0

NLTKClassifier doesn't work like scikit estimators. It requires the X and y both in a single array which is then passed to train().

But in your code, you are only supplying it the X_train and it tries to unpack y from that and hence the error.

The NaiveBayesClassifier requires the input to be a list of tuples where list denotes the training samples and the tuple has the feature dictionary and label inside. Something like:

X = [({feature1:'val11', feature2:'val12' .... }, class1),
     ({feature1:'val21', feature2:'val22' .... }, class2), 
     ...
     ...                                                  ]

You need to change your input to this format.

feature_names = cv.get_feature_names()
train_set = []
for i, single_sample in enumerate(X):
    single_feature_dict = {}
    for j, single_feature in enumerate(single_sample):
        single_feature_dict[feature_names[j]]=single_feature
    train_set.append((single_feature_dict, y[i]))

Note: The above for loop can be shortened by using dict comprehension but I'm not that fluent there.

Then you can do this:

nltk.NaiveBayesClassifier.train(train_set)

edited Mar 15 '18 at 10:21

answered Mar 05 '18 at 12:16

Vivek Kumar

35,217
8
109
132

Thanks! I tried to do it, but I get a Memory error. My dataset is too large. Any way to apply the loop in small increments instead of applying y[i] to the whole dataset at one go?? – mustang Mar 06 '18 at 04:49
@SriReka y[i] should be a single value, the class label of that individual sample. So my code is doing it incrementally only one at a time. Your system is holding X and y in the memory so the only wat memoryError occurs here is when the data is duplicated from X to the train_set tuples. This can be avoided by first writing the tuple format to a file and then in a new program reading it again. This way at a single time X and train_set will not be present. – Vivek Kumar Mar 06 '18 at 04:56
Thank you so much!!. I have successfully trained the classifier without any memory errors. But I have a doubt. In your code, in the last line, train_set.append((single_dict, y[i])) , single_dict is undefined. I tried it with single_feature_dict. Is it right?? – mustang Mar 15 '18 at 10:08
@SriReka Yes, that was a typo. Fixed now. Thanks. – Vivek Kumar Mar 15 '18 at 10:20

ValueError: too many values to unpack (NLTK classifier)

1 Answers1

Linked