Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality

Question

I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.

Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.

file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)

However I get an error: all the input array dimensions except for the concatenation axis must match exactly.

After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer'])) on its own returns an error: Expected 2d array got 1d array instead.

I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.

Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?

Thanks.

score 4 · Accepted Answer · answered Feb 22 '19 at 10:35

I believe the problem lies in the fact that you're giving two columns to the TfIdfVectorizer (which is thus converted to a DataFrame). This will not work: TfIdfVectorizer expects a list of strings. So an immediate solution (and therefore a check of whether this is in fact the source of the problem), is changing this line to: string_features = 'Description'. Note this is not a list, it just a string. Therefore the Series is passed to the TfIdfVectorizer, and not the DataFrame.

If you would like to combine both string columns, you could either

concatanenate the strings, so you keep one column (which is the easiest), or
Fit two different TfIdfVectorizers, which is more complex but might perform better. See for instance Computing separate tfidf scores for two different columns using sklearn

Should this not solve your problem, I would advise you to share some sample data so we can actually test what is happening.

I believe the difference between your perceived error and the actual pipeline lies in the fact that you're giving it X_train['Customer'] (again a Series), but in the actual pipeline you're giving it X_train[['Customer']] (a DataFrame).

Thank you so much you were absolutely correct. Thank you for your insight and quick response! — Colossal Harry, Feb 22 '19 at 11:04
Check this answer for a complete example https://stackoverflow.com/a/54704747/7358899 — HazimoRa3d, Sep 10 '19 at 13:19

Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality

1 Answers1