Reshape pandas.Df to use in GridSearch

Question

I am trying to use multiple feature columns in GridSearch with Pipeline. So I pass two columns for which I want to do a TfidfVectorizer, but I get into trouble when running the GridSearch.

Xs = training_data.loc[:,['text','path_contents']]
y = training_data['class_recoded'].astype('int32')

for col in Xs:
    print Xs[col].shape

print Xs.shape
print y.shape

# (2464L,)
# (2464L,)
# (2464, 2)
# (2464L,)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252", stop_words="english")), 
                     ('nb', MultinomialNB())])

parameters = {
    'vectorizer__max_df': (0.48, 0.5, 0.52,),
    'vectorizer__max_features': (None, 8500, 9000, 9500),
    'vectorizer__ngram_range': ((1, 3), (1, 4), (1, 5)),
    'vectorizer__use_idf': (False, True)
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)
    grid_search.fit(Xs, y) # <- error thrown here

    print("Best score: {0}".format(grid_search.best_score_))  
    print("Best parameters set:")  
    best_parameters = grid_search.best_estimator_.get_params()  
    for param_name in sorted(list(parameters.keys())):  
        print("\t{0}: {1}".format(param_name, best_parameters[param_name]))

Error: ValueError: Found input variables with inconsistent numbers of samples: [2, 1642]

I read a similar error here and here, and I tried both questions' suggestions but to no avail.

I tried selecting my data in a different way:

features = ['text', 'path_contents']
Xs = training_data[features]

I tried using .values instead as suggested here, like so:

grid_search.fit(Xs.values, y.values)

but that gave me the following error:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

So what's going on? I'm not sure how to continue from this.

`TfidfVectorizer.fit()` needs a iterable object who's elements are string, but `Xs` contains two columns, so, every element of `Xs` is an array object, and when `TfidfVectorizer` call `lower()` method of items in `Xs`, it raise AttributeError: 'numpy.ndarray' object has no attribute 'lower'. So the question is why you pass two columns to `TfidfVectorizer`. — HYRY, May 27 '17 at 12:13
As I have suggested in my answer to your [previous question](https://stackoverflow.com/a/44212413/3374996), using multiple columns in TfidfVectorizer inside pipeline is not that straight forward. — Vivek Kumar, May 27 '17 at 12:47

score 0 · Accepted Answer · answered May 27 '17 at 12:20

TfidfVectorizer expects input a list of strings. That explains "AttributeError: 'numpy.ndarray' object has no attribute 'lower'" because you input 2d-array, which means a list of arrays.

So you have 2 choices, either concat 2 columns into 1 column beforehand (in pandas) or if you want to keep 2 columns, you could use feature union in the pipeline (http://scikit-learn.org/stable/modules/pipeline.html#feature-union)

About the first exception, I guess it's caused by the communication between pandas and sklearn. However you cannot tell for sure because of the above error in the code.

If the first exception were "Error: ValueError: Found input variables with inconsistent numbers of samples: [2, **2464**]", I would say that it's also caused by the malformed input. As Xs is a list of 2 columns, y is series of length 2464, which make the message [2, **2464**]. — THN, May 27 '17 at 12:34

Reshape pandas.Df to use in GridSearch

1 Answers1

Linked