I am trying to perform sentiment analysis on Uber-Review. I have used Naive bays sklearn to perform sentiment analyis,I used trianing data from kaggle on reviwes, But The test data is in xlsx sheet, I used pandas to create data frame,
import pandas as pd
test=pd.read_excel("uber.xlsx",sep="\t",encoding="ISO-8859-1");
test.head(3)
as it returned d:type object, I transformed it to list using this
test_text = []
for comments in comments_t:
test_text.append(comments)
My code for classifying text based on training data:
# Training Phase
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB().fit(train_documents,labels)
def sentiment(word):
return classifier.predict(count_vectorizer.transform([word]))
but while predicting it return this value error:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter = {}
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in decode(self, doc)
141
142 if doc is np.nan:
--> 143 raise ValueError("np.nan is an invalid document, expected byte or "
144 "unicode string.")
145
ValueError: np.nan is an invalid document, expected byte or unicode string.
I tried to solve according to this:
https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document