I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.
Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.
file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)
However I get an error: all the input array dimensions except for the concatenation axis must match exactly.
After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer']))
on its own returns an error: Expected 2d array got 1d array instead.
I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.
Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?
Thanks.