1

I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.

Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?

I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.

Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?

A subset of the data and my code is here:

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Import data
df  = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text

categories = ['TV','Internet','Mobil','Fastnet']

# Model
LogReg_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
                 ])
    
for category in categories:
    print('... Processing {}'.format(category))
    LogReg_pipeline.fit(X_train, train[category])
    prediction = LogReg_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

https://www.transfernow.net/dl/20210921NbWDt3eo

DV82XL
  • 5,350
  • 5
  • 30
  • 59
hideonbush
  • 149
  • 1
  • 10

1 Answers1

1

Code Analysis

The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.

Multi-Output Regressor

  • Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
  • The output should be the same as what you have, but you only need to maintain a single model and train it once.
  • To use this approach, wrap your LR model in a MultiOutputRegressor.
  • Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', MultiOutputRegressor(model))])

preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)

combine_data() merges all data into a single DataFrame for convenience:

def combine_data(X, Y, y_cols):
    """ X is a dataframe, Y is a np array, y_cols is a list """
    df_out = pd.DataFrame(Y, columns=y_cols)
    df_out.index = X.index
    return pd.concat([X, df_out], axis=1).sort_index()

Multinomial Logistic Regression

  • To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
  • The softmax function is used to find the predicted probability of a sample belonging to a class.
  • You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
  • Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])

# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)

# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
DV82XL
  • 5,350
  • 5
  • 30
  • 59
  • Hi, thanks for the response, very helpful. I have tried to use your MultiLogReg code, but have troubles with X_train being shape (42141,) while df_labels is (55194, 4). I need to make the rows in df_labels match the ones in X_train, so they both become 42141 rows, but can't figure out to do it – hideonbush Sep 21 '21 at 10:57
  • Right. It's because you're removing rows from `df` when sum is 0, but not from `df_labels`. You need to remove rows from both, e.g. when the dataframes are still merged. – DV82XL Sep 21 '21 at 11:03
  • 1
    I have tried that now, but it doesn't seem to fix it. To me it looks like the problem is that I split the df into train/test while the df_labels is not splitted the same way... EDIT: I have fixed that part now – hideonbush Sep 21 '21 at 11:23
  • Good catch. It might be a good idea to do all operations on the data and labels in the same data frame, then extract what you need when you're done. – DV82XL Sep 21 '21 at 11:26
  • Your model seem to work correctly now! Now I get arrays out, so it should be working. Do you know how I convert this array into a easier interpretable output? array([[1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], ..., [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0]]) – hideonbush Sep 21 '21 at 11:30
  • I can't really figure out how to implement that into what I have in my code to be honest – hideonbush Sep 21 '21 at 11:49
  • I would like an output showing whether a label is 0 or 1 for each text in the test set if that is possible – hideonbush Sep 21 '21 at 11:54
  • This the output, seems like something is not working totally right, but it is in the correct direction! https://snipboard.io/pflJM7.jpg – hideonbush Sep 21 '21 at 12:15
  • I have made it work now. This model actually gives the same output as the one I started with. But thanks for the help and guidance, appreciate that – hideonbush Sep 21 '21 at 12:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237315/discussion-between-dv82xl-and-mads-emil-hvid-rasmussen). – DV82XL Sep 21 '21 at 12:36