np.nan is an invalid document, expected byte or unicode string in CountVectorizer

Question

I am trying to create dependency colums for each non numeric attribute and eliminate those non-numeric attributes in adult data set from UCI. I am using CountVectorizer from sklearn.feature_extraction.text lib. But I got stuck where my program says, np.nan is an invalid document, expected byte or unicode string."

I just want to understand why am I getting that error. Can anyone help me out, Thankyou.

here goes my code,

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

def check(ex):
    try:
        int(ex)
        return False
    except ValueError:
        return True

feature_cols = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Target']

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, names = feature_cols)

feature_cols.remove('Target')
X = data[feature_cols]
y = data['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

columns = X.columns

vect = CountVectorizer()

for each in columns:
    if check(X[each][1]):
        temp = X[each]
        X_dtm = pd.DataFrame(vect.fit_transform(temp).toarray(), columns = vect.get_feature_names())
        X = pd.merge(X, X_dtm, how='outer')
        X = X.drop(each, 1)

print X.columns

Error is like this

Traceback (most recent call last): File "/home/amey/prog/pd.py", line 41, in X_dtm = pd.DataFrame(vect.fit_transform(temp).toarray(), columns = vect.get_feature_names())

File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform self.fixed_vocabulary_)

File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab for feature in analyze(doc):

File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 238, in tokenize(preprocess(self.decode(doc))), stop_words)

File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 118, in decode

raise ValueError("np.nan is an invalid document, expected byte or "

ValueError: np.nan is an invalid document, expected byte or unicode string.

[Finished in 3.3s with exit code 1]

plz refer to this answer: https://stackoverflow.com/a/39308809/1042586 — nextofsearch, Oct 27 '17 at 05:43

score 0 · Answer 1 · edited Oct 10 '21 at 05:32

0

Some of the features in your feature column are nan. So, before using CountVectorizer these features you need to change these values.
Immediately after importing the data use this:

some_variable = your_feature_data.fillna('c')
c= #you can fill it or leave empty as per your choice.

edited Oct 10 '21 at 05:32

MD Mushfirat Mohaimin

1,966
3
10
22

answered Oct 09 '21 at 18:43

Lone Mohsin

1
1

np.nan is an invalid document, expected byte or unicode string in CountVectorizer

1 Answers1