4

I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.

Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.

file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)

However I get an error: all the input array dimensions except for the concatenation axis must match exactly.

After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer'])) on its own returns an error: Expected 2d array got 1d array instead.

I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.

Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?

Thanks.

1 Answers1

4

I believe the problem lies in the fact that you're giving two columns to the TfIdfVectorizer (which is thus converted to a DataFrame). This will not work: TfIdfVectorizer expects a list of strings. So an immediate solution (and therefore a check of whether this is in fact the source of the problem), is changing this line to: string_features = 'Description'. Note this is not a list, it just a string. Therefore the Series is passed to the TfIdfVectorizer, and not the DataFrame.

If you would like to combine both string columns, you could either

Should this not solve your problem, I would advise you to share some sample data so we can actually test what is happening.

I believe the difference between your perceived error and the actual pipeline lies in the fact that you're giving it X_train['Customer'] (again a Series), but in the actual pipeline you're giving it X_train[['Customer']] (a DataFrame).

Jondiedoop
  • 3,303
  • 9
  • 24