inverse the binarized dataframe to original categorical values after un-pickling

Question

I am trying to solve a classification problem where the label column contains string values.

Steps followed in Training the model :-

Converted the dataframe to binarized values using pandas.get_dummies.
Trained the Randomforest classifier (scikit) model
Pickled the model

Testing the model:-

Unpickled the model
Passed the test data and got the result from the Radom Forest Classifier
The output is in binarized format

Objective:-

would like to inverse this data to its original string value.

Please suggest if there is a solution.

Note:- Most of the threads in the internet are taking me only till the result from the classifier. Or doing the training and testing it in a single program.

score 2 · Accepted Answer · answered Oct 22 '18 at 13:38

Aside from your problem, use joblib instead of pickle because it is much more efficient to store models such as Random Forest, and now for your problem there are some things to consider:

Pickling or not, the output of your treatment is the same. Pickling is a way to store your model and once your random forest is unpickled it has the same properties and characteristics as before. It may be the case that you misconcieve your input format or that you do not know how to apply the prediction method. Let's take an example : a DataFrame with 3 categorical variables and a certain class depending on the 3 features.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data='example.csv', columns=['val1', 'val2', 'val3', 'class'])

Now applying one-hot encoding and fitting a Random Forest to "class" column :

#Turning it into dummies
dummies = pd.get_dummies(df[['col1', 'col2', 'col3']])

#Random forest
clf = RandomForestClassifier()
model = clf.fit(dummies, df.class)

Dumping and loading the model with joblib :

from sklearn.externals import joblib
#Dumping
joblib.dump(clf, 'filename.pkl') 

#Loading
clf = joblib.load('filename.pkl')

Or with pickle if you stick to it :

import cPickle

#Dumping
with open('path/to/file', 'wb') as f:
    cPickle.dump(clf, f)

#Loading
with open('path/to/file', 'rb') as f:
    clf = cPickle.load(clf)

Now that you reloaded your model, the proper way to obtain a result is to use the predict method to obtain the class from another value. Picture that you have a second DataFrame that has the similar format, except that the class column is missing. You would to it the following way :

df_test = pd.read_csv("test.csv", columns=['col1', 'col2', 'col3'])

#Creating dummies
dummie_test = pd.get_dummies(df_test)

#Getting the prediction
df_test['predicted'] = clf.predict(dummies_test)

Hi @Jenny if this answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this, but that's usually how things are done here — SantiStSupery, Oct 23 '18 at 07:19

inverse the binarized dataframe to original categorical values after un-pickling

Steps followed in Training the model :-

Testing the model:-

Objective:-

1 Answers1