Categorical Data with tpot

Question

I'm trying to use tpot with my inputs in pandas dataframes. I keep getting the error:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I believe this error is from isnan not being able to handle my data structure, but I'm unsure how to format it differently. I have a combination of categorical and continuous inputs and continuous outputs. Here's an example of code with similar data:

train_x=[[1,2,3],['test1','test2','test3'],[56.2,4.5,3.4]]
train_y=[[3,6,7]]
from tpot import TPOTRegressor

tpot=TPOTRegressor()

Do I have to convert my categorical data somehow? dataframe.values and dataframe.as_matrix give me objects that also give me an error.

score 6 · Answer 1 · answered Apr 13 '18 at 18:35

6

That's right - you need to convert your categorical values. TPOT assumes that all data will come in a scikit-learn compatible format, which entails that all of the data is numeric. We only recently added support for missing values, though most scikit-learn algorithms do not accept data with missing values either.

I reworked your example below to show how pandas can be used to convert the categorical values to numerical values.

import pandas as pd
from tpot import TPOTRegressor

train_x = pd.DataFrame()
train_x['a'] = [1,2,3,4]
train_x['b'] = ['test1','test2','test3','test4']
train_x['c'] = [56.2,4.5,3.4,6.7]

# This line one-hot encodes the categorical variables
train_x = pd.get_dummies(train_x).values
# Print train_x out to understand what one-hot encoding entails
print(train_x)

train_y = [3,6,7,9]

my_tpot = TPOTRegressor(cv=2)
my_tpot.fit(train_x, train_y)

answered Apr 13 '18 at 18:35

Randy Olson

3,131
2
26
39

Thank you so much, Randy! That makes sense! – Deborah Paul Apr 16 '18 at 03:05
I ran tpot for a couple hours, and then stopped it early, and I got a similar warning for testing_features instead of features this time. Any idea what's going on? Here's the full warning:line 832, in score if np.any(np.isnan(testing_features)): TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' – Deborah Paul Apr 17 '18 at 13:27
It seems scikit-learn can handle boolean features, but TPOT can't. – asmaier Aug 05 '18 at 20:34
1

@Randy - I see one of the transforms is onehotencoder. Does this not one-hot encode the data so it is still necessary? – J Spen Jun 18 '20 at 03:05
1

@JSpen TPOT assumes that you have already encoded the data in the appropriate manner. TPOT does use one-hot encoder in some configurations, but not always. It's better to encode the data yourself into a numerical format so you don't run into any issues. – Randy Olson Jun 19 '20 at 16:15
@RandyOlson Yeah, I was playing around a bit and as you said. It might onehotencode those variables, but it isn't a guarantee. It's just one of the transformations that is available. Thanks for the info! – J Spen Jun 20 '20 at 17:03

Categorical Data with tpot

1 Answers1

Linked