I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer
and LabelBinarizer
, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))