4

This question was helpful in realizing that I can split training and validation data. Here is the code I use to load my train and test.

def load_data(datafile):
    training_data = pd.read_csv(datafile, header=0, low_memory=False)
    training_y = training_data[['job_performance']]
    training_x = training_data.drop(['job_performance'], axis=1)

    training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    training_x.fillna(training_x.mean(), inplace=True)
    training_x.fillna(0, inplace=True)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns

    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y

Where the datafile is my training file. I have another file, test.csv that has the same columns as the training file, except it may be missing categories. How can I do the get_dummies across the test file and ensure the categories are encoded in the same way?

Additionally, my test data is missing job_performance column, how can I handle this in the function?

cs95
  • 379,657
  • 97
  • 704
  • 746
Shamoon
  • 41,293
  • 91
  • 306
  • 570
  • Why do you want to use `job_performance` in training, if it is not in testing? – harvpan Jun 24 '19 at 19:18
  • I don’t want to use it in training. But I need the get dummies to align. – Shamoon Jun 24 '19 at 19:41
  • Your problem is about the dummies aligning, which is fine. It seems job_performance is an unrelated concern? You can just handle that with if statements, right? – cs95 Jun 27 '19 at 05:10
  • 1
    I have rewritten your problem to highlight the more important issue of consistency in encoding. – cs95 Jun 27 '19 at 05:16

2 Answers2

3

When dealing with multiple columns, it is best to use sklearn.preprocessing.OneHotEncoder. This is good at keeping track of your categories and handles unknown categories gracefully.

sys.version
# '3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]'
sklearn.__version__
# '0.20.0'
np.__version__
# '1.15.0'
pd.__version__
# '0.24.2'

from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'data': [1, 2, 3],
    'cat1': ['a', 'b', 'c'],
    'cat2': ['dog', 'cat', 'bird']
})

ohe = OneHotEncoder(handle_unknown='ignore')
categorical_columns = df.select_dtypes(['category', object]).columns
dummies = pd.DataFrame(ohe.fit_transform(df[categorical_columns]).toarray(), 
                       index=df.index, 
                       dtype=int)

df_ohe = pd.concat([df.drop(categorical_columns, axis=1), dummies], axis=1)
df_ohe

   data  0  1  2  3  4  5
0     1  1  0  0  0  0  1
1     2  0  1  0  0  1  0
2     3  0  0  1  1  0  0

You can see the categories and their ordering:

 ohe.categories_
# [array(['a', 'b', 'c'], dtype=object),
#  array(['bird', 'cat', 'dog'], dtype=object)]

Now, to reverse the process, we just need the categories from before. No need to pickle or unpickle any models here.

df2 = pd.DataFrame({
    'data': [1, 2, 1],
    'cat1': ['b', 'a', 'b'],
    'cat2': ['cat', 'dog', 'cat']
})

ohe2 = OneHotEncoder(categories=ohe.categories_)
ohe2.fit_transform(df2[categorical_columns])

dummies = pd.DataFrame(ohe2.fit_transform(df2[categorical_columns]).toarray(), 
                       index=df2.index, 
                       dtype=int)
pd.concat([df2.drop(categorical_columns, axis=1), dummies], axis=1)

   data  0  1  2  3  4  5
0     1  0  1  0  0  1  0
1     2  1  0  0  0  0  1
2     1  0  1  0  0  1  0

So what does this mean for you? You'll want to change your function to work for both train and test data. Add an extra parameter categories to your function.

def load_data(datafile, categories=None):
    data = pd.read_csv(datafile, header=0, low_memory=False)
    if 'job_performance' in data.keys():
        data_y = data[['job_performance']]
        data_x = data.drop(['job_performance'], axis=1)
    else:
        data_x = data
        data_y = None

    data_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    data_x.fillna(data_x.mean(), inplace=True)
    data_x.fillna(0, inplace=True)

    ohe = OneHotEncoder(handle_unknown='ignore', 
                        categories=categories if categories else 'auto')

    categorical_data = data_x.select_dtypes(object)
    dummies = pd.DataFrame(
                ohe.fit_transform(categorical_data.astype(str)).toarray(), 
                index=data_x.index,
                dtype=int)

    data_x = pd.concat([
        data_x.drop(categorical_data.columns, axis=1), dummies], axis=1)

    return (data_x, data_y) + ((ohe.categories_, ) if not categories else ())

Your function can be called as,

# Load training data.
X_train, y_train, categories = load_data('train.csv')
...
# Load test data.
X_test, y_test = load_data('test.csv', categories=categories)

And the code should work fine.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • ValueError: could not convert string to float: 'bird' – Pyd Jun 27 '19 at 12:48
  • @pyd: It works for me, sorry. I have updated with my module versions. Let me know what is different. – cs95 Jun 27 '19 at 13:23
  • 1
    @pyd You might find [this](https://stackoverflow.com/questions/43588679/issue-with-onehotencoder-for-categorical-features) useful. You may be running an older version of sklearn where OHE only accepts integer labels. I assume this was changed in later versions. Please upgrade, it's easier. – cs95 Jun 27 '19 at 13:27
  • I'm also having an issue with the `OHE`: `TypeError: argument must be a string or number`. I can't upgrade `sklearn` for some reason, as it stays at version `0.0`. I'm using `pipenv` if that matters – Shamoon Jun 27 '19 at 14:15
  • @Shamoon Check out [this](https://stackoverflow.com/a/50996283/4909087) link for upgrading modules inside pipenv. You are probably having the same issue that pyd was! – cs95 Jun 27 '19 at 14:19
  • I managed to upgrade to 0.20, but have the same error `scikit-learn==0.20.0` – Shamoon Jun 27 '19 at 14:20
  • @Shamoon I think your object column might have mixed strings and integers, is it possible? OHE can only take either ONLY int or only strings per column. Let me know. Another option is to try upgrading to `scikit-learn==0.21.2`. If it nothing works, could you provide some sample data to debug? – cs95 Jun 27 '19 at 14:23
  • It's possible it may be mixed. Some of the categorical information are things like zip code, etc, which are numeric. Any way to cast them to string first? – Shamoon Jun 27 '19 at 14:37
  • @Shamoon I've edited. Basically, use select_dtypes to get object columns, then convert those to str using `.astype(str)`. – cs95 Jun 27 '19 at 14:40
  • So close: `ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).` – Shamoon Jun 27 '19 at 16:16
  • @Shamoon Which line of code throws that error, and is it during training or testing? :D – cs95 Jun 27 '19 at 16:17
  • When I load `test.csv`, and specifically: ` ohe.fit_transform(data_x[categorical_columns]).toarray()` – Shamoon Jun 27 '19 at 16:26
  • @Shamoon I figured out the problem is because some of the categorical columns from your train data are either missing in your test data, or you have extra columns in test. Let me know how you want to handle it. You mentioned only job_performance is missing so my code does not handle any other missing columns. – cs95 Jun 27 '19 at 16:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195647/discussion-between-shamoon-and-cs95). – Shamoon Jun 27 '19 at 17:23
2

If you want to use pandas get_dummies you will need to manually add columns for values in train but not in test and ignore columns in test but not in train.

You could use the dummies column names ('origcolumn_value' by default) to do that, and use separate functions for train and test.

Something along these lines (haven't tested it):

def load_and_clean(datafile_path, labeled=False):
    data = pd.read_csv(datafile_path, header=0, low_memory=False)

    if labeled:
        job_performance = data['job_performance']
        data = data.drop(['job_performance'], axis=1)

    data.replace([np.inf, -np.inf], np.nan, inplace=True)
    data.fillna(data.mean(), inplace=True)
    data.fillna(0, inplace=True)

    if labeled:
        data['job_performance'] = job_performance

    return data

def dummies_train(training_data):
    training_y = training_data[['job_performance']]
    training_x = data.drop(['job_performance'], axis=1)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns
    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y, training_x.columns

def dummies_test(test_data, model_columns):
    categorical_data = test_data.select_dtypes(
        include=['category', object]).columns
    test_data = pd.get_dummies(test_data, columns=categorical_data)
    for c in model_columns:
        if c not in test_data.columns:
            test_data[c] = 0
    return test_data[model_columns]

training_x, training_y, model_columns = dummies_train(load_and_clean(<train_data_path>), labeled=True)
test_x = dummies_test(load_and_clean(<test_data_path>), model_columns)
Ezer K
  • 3,637
  • 3
  • 18
  • 34
  • This still doesn't guarantee alignment because it doesn't account for unseen categories in test. You will also want to ensure the test column order is the same. – cs95 Jun 27 '19 at 13:37
  • think the "test_data[model_columns]" in the dummies_test func should take care of that – Ezer K Jun 27 '19 at 14:20