2

I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)

It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.

A couple restrictions: X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...

import numpy as np
import pandas as pd
import xgboost as xgb

import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt


# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
                   'Actual_End':  np.random.uniform(low=-1, high=1, size=50),
                   'Late':        np.random.random_integers(low=0,  high=2, size=50)}
)

# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']

clf = xgb.XGBClassifier()
clf.fit(X, y)

So now when I try to use plot_decision_regions passing X/y as numpy arrays...

# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
                      y=y.values,
                      clf=clf,
                      legend=2)

I (understandably) get an error that the model can't find the column names of the dataset it was trained on

ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0

In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?

Max Power
  • 8,265
  • 13
  • 50
  • 91

1 Answers1

0

Try to change:

X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values

and proceed to:

clf = xgb.XGBClassifier()
clf.fit(X, y)

plot_decision_regions(X=X,
                      y=y,
                      clf=clf,
                      legend=2)

OR fit & plot using X.values and y.values

8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
  • thanks for the response Sangeetha. However, this solution doesn't work for me because you are explicitly training the model on numpy arrays, not a Pandas Dataframe, in order to get `plot_decision_regions` to work. My question concluded "In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?" – Max Power Jul 09 '18 at 16:51
  • Please refer to https://stackoverflow.com/questions/42338972/valueerror-feature-names-mismatch-in-xgboost-in-the-predict-function . The only way it works is while converting the pandas dataframe to numpy array. I tried the above solution as well fitting the classfier with X.values, y.values & then plotting decision regions. This isnt an issue with mlxtend.plotting, but with xgboostclassifier. – Sangeetha James Jul 09 '18 at 22:49