-3

I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.

After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.

I ran a simple test in jupyter notebook to show the error:

In the inverse_transformed value it messes up in row 1 mixing up the state and month.

jupyter notebook code

First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?

EDIT: Thank you to @Nicolas M. for helping me solve my question. I ended up solving this issue like this.

Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it

I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.

One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.

here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.

inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.

in this code data is the dataframe that I pass to the function.

lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])

# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():

    lb_dict[column].fit(data[column])
    binarized_label = lb_dict[column].transform(data[column])

    header = [column + str(i) for i in range(0, len(binarized_label[0]))]

    binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))

# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)

Here is the inverse_transform function that I wrote:

inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])

# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
    # create a list of the different headers based on if the name contains the original output column name
    one_label_list = [x for x in output.columns.values.tolist() if column in x]
    one_label_df = output[one_label_list]

    # inverse transform the data frame for one label
    single_label = label_binarizer[column].inverse_transform(one_label_df.values)

    # join the output of the single label df to the entire output df
    inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))

inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)
Ethan Kulla
  • 333
  • 2
  • 9
  • 2
    Why is your code inside a picture and not text in your question? I certainly prefer copy pasting your code instead of typing it out, and I'm sure everyone else feels the same way. You need to make it easier for people to help. – cs95 Aug 18 '18 at 01:22

1 Answers1

5

The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :

enter image description here

You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.

So if you take the line one, you have :

1000 - California - January

Now if you take the second one, you have :

750 - February - New York

And so on...

So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"

What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.

To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question

EDIT 1:

I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict

data = {
    "State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
    "Month" : ["January", "February", "May", "February", "January", "February" ],
    "Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)

d = defaultdict(MultiLabelBinarizer)  # dict of Features => model

list_encoded = []  # store single matrices
for column in df:
    d[column].fit(df[column])
    list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32

I hope it helps and the explaination is clear enough,

Nicolas

Nicolas M.
  • 1,472
  • 1
  • 13
  • 26