0

I am implementing Naive Bayesian classifier with NLTK. But when i train classifier with extracted features it gives error "too many values to unpack". I am just beginner to python. Here is code. Program is reading text from files and extracting features from these files.

import nltk.classify.util,os,sys;
from nltk.classify import NaiveBayesClassifier;
from nltk.corpus import stopwords;
from nltk.tokenize  import word_tokenize,RegexpTokenizer;
import re;
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
 return TAG_RE.sub('', text)

def word_feats(words):
 return dict([(word,True) for word in words])

def feature_extractor(sentiment):
 path = "train/"+sentiment+"/"
 files = os.listdir(path);
 feats = {};
 i = 0;
 for file in files:
    f = open(path+file,"r", encoding='utf-8');
    review = f.read();
    review = remove_tags(review);
    stopWords = (stopwords.words("english"))
    tokenizer = RegexpTokenizer(r"\w+");
    tokens = tokenizer.tokenize(review);    
    features = word_feats(tokens);
    feats.update(features)
  return feats;

posative_feat = feature_extractor("pos");
p = open("posFeat.txt","w", encoding='utf-8');
p.write(str(posative_feat));  
negative_feat = feature_extractor("neg");
n = open("negFeat.txt","w", encoding='utf-8');
n.write(str(negative_feat));
plength = int(len(posative_feat)*3/4);
nlength = int(len(negative_feat)*3/4)
totalLength = plength+nlength;
trainFeatList = {}
testFeatList  = {}
i = 0
for items in posative_feat.items():
 i +=1;
 value = {items[0]:items[1]}
 if(i<plength):
    trainFeatList.update(value);
 else:  
    testFeatList.update(value);     

j = 0
for items in negative_feat.items():
  j +=1;
  value = {items[0]:items[1]}
  if(j<plength):
    trainFeatList.update(value);
  else:
    testFeatList.update(value);
 classifier = NaiveBayesClassifier.train(trainFeatList)
 print(nltk.classify.util.accuracy(classifier,testFeatList));
 classifier.show_most_informative_features();
Asad Raza
  • 36
  • 6
  • 1
    Possible duplicate of [NLTK accuracy: "ValueError: too many values to unpack"](http://stackoverflow.com/questions/31920199/nltk-accuracy-valueerror-too-many-values-to-unpack) – Pierre Jan 04 '17 at 16:51

1 Answers1

2

looking at the NLTK book page http://www.nltk.org/book/ch06.html it seems the data that is given to the NaiveBayesClassifier is of the type list(tuple(dict,str)) whereas the data you are passing to the classifier is of the type list(dict).

If you represent the data in a similar manner, you will get different results. Basically, it is a list of (feature dict, label).

There are multiple errors in your code:

  1. Python does not use a semicolon as a line ending
  2. The True boolean does not seem to serve a purpose on line 12
  3. trainFeatList and testFeatList should be lists
  4. each value in your feature items list should betuple(dict,str)
  5. assign labels to features in the list (in (4))
  6. take NaiveBayesClassifier, and any use of classifier out of the negative features loop

If you fix the previous errors, the classifier will work, but unless I know what you are trying to achieve it is confusing and does not predict well.

the main line you need to pay attention to is when you assign something to your variable value.

for example:

value = {items[0]:items[1]}

should be something like:

value = ({feature_name:feature}, label)

Then afterwards you would call .append() on your lists to add each value instead of .update().

You can look at an example of your updated code in a buggy working state at http://pastebin.com/91Zu59Cm but I would suggest thinking about the following:

  • How is the data supposed to be represented for the NaiveBayesClassifier class?
  • What features are you trying to capture?
  • What labels are associated with those features?
Nathan McCoy
  • 3,092
  • 1
  • 24
  • 46
  • Thanks for your detail answer; I am correcting all errors. I have positive and negative text files and extracting word features from these files and label each word feature according to positive or negative text file. – Asad Raza Jan 04 '17 at 18:26
  • you may want to think about the features associated with each label, an potentially making multiple features – Nathan McCoy Jan 04 '17 at 18:30
  • I want to extract word features from positive file and then want to label each word as positive same like negative file. – Asad Raza Jan 04 '17 at 18:35
  • I think you are confusing labels and features. Labels are the class assigned to a data sample. Features describe data used to train a classifier. – Nathan McCoy Jan 05 '17 at 12:50
  • I have clear concept of class labels and features. I am giving class label to word features which are extracted from positive files and same for negative files. I have positive and negative sentiment movie reviews files. – Asad Raza Jan 05 '17 at 14:00
  • then what are the labels for the feature set in each data instance? – Nathan McCoy Jan 05 '17 at 14:01
  • I am giving TRUE like ("good",True) in word_feat function. Yes i think here is mistake. Please let me correct i should do this like ("good":"Pos") ? – Asad Raza Jan 05 '17 at 14:09
  • I don't think this is a feature, it is a label. Please explain what are you features for each label, and then I can help. – Nathan McCoy Jan 05 '17 at 14:10
  • I am reading positive and negative sentiment text files and label each word as positive if it extracted from positive text file and negative if it is extracted from negative file. – Asad Raza Jan 05 '17 at 14:15
  • in your code, you are not doing that, you are giving a feature "pos" to a word, and never make a label, hence the data structure is wrong. so what are your labels vs. features? – Nathan McCoy Jan 05 '17 at 19:56
  • "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as 'Teachers' ". This is line from positive review text file. Please let me know how can i create feature vector with positive label from this line. – Asad Raza Jan 06 '17 at 11:29