66

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Something that would make this:

categories = ['a', 'b', 'c']

   cat
1   a
2   b
3   a

Become this:

  cat_a  cat_b  cat_c
1   1      0      0
2   0      1      0
3   1      0      0
ayhan
  • 70,170
  • 20
  • 182
  • 203
Berne
  • 793
  • 1
  • 7
  • 8
  • 1
    you are looking for the `sklearn.OneHotEncoder`. Look here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html – ssm May 25 '16 at 01:12
  • 2
    @ssm: `get_dummies` implements the same functionality as `OneHotEncoder`, with the added benefit that the output is an easy to understand pandas dataframe with labeled columns, instead of a plain `ndarray`. – T.C. Proctor May 26 '16 at 20:59
  • I had misunderstood the question. Thanks! – ssm May 27 '16 at 00:54
  • I think for model training, it's not necessary to perform this step. If one category is missing in the training data, either if you porivde a column with all zeros or not, you model will not learn anything helpful to predict for test instances that contain that variable. – Quickbeam2k1 Jul 28 '17 at 05:51
  • @Quickbeam2k1 The number of cases where this isn't necessary is pretty small. For initial prototyping, it may not be, but for any production code you'd want to ensure that all model inputs have the same columns. – T.C. Proctor Sep 28 '17 at 17:50
  • This depends on the data, say you have a model, but for one single category a new value appears rather suddenly. In that case your model will still be able to give predictions for such values. However, if you encode the categories manually, your model will produce errors. The question is, what is desired. All I'm saying is: maybe you don't know all input values beforehand. Additionally, when retraining the model, the new values for the categories are naturally treated in the model. – Quickbeam2k1 Sep 28 '17 at 18:41
  • @Quickbeam2k1 Just about every model I know about requires consistent dimensionality of the input data. If you don't encode the categories as below, you'll have a change in dimensionality if "a new value appears rather suddenly". – T.C. Proctor Oct 04 '17 at 21:28
  • @Quickbeam2k1 At least with [this method](https://stackoverflow.com/a/37451867/3358599), if a value appears that was unknown before, it will not create a new column for it - that row will be all zeroes. This guarantees consistent dimensionality. As an aside, it is probably a good idea to only include categories that appear in a training set, as the treatment of novel categories may be unpredictable in many models. – T.C. Proctor Oct 04 '17 at 21:37
  • @T.C.Proctor exactly, additionally e.g. piRSquared's solution will have the same benefit. There is no need in passing the potential category levels to get_dummies. However, if dataframes need to be combined and get_dummies needs to be called for whatever reason before the combination, I admit it might be necessary to know the category levels beforehand. If the get_dummies call happens late in a pipeline it will in general not be necessary to pass the category levels due to the behavior you described above – Quickbeam2k1 Oct 05 '17 at 05:13
  • @Quickbeam2k1 The behavior I described above is exactly a reason to pass the categories - if you don't, an extra column will be created for novel columns, which is likely to throw an error unless you deliberately drop it, at which point you might as well have passed the categories explicitly. – T.C. Proctor Nov 02 '17 at 14:44

11 Answers11

60

TL;DR:

pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))
  • Older pandas: pd.get_dummies(cat.astype('category', categories=categories))

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:

In [1]: import pandas as pd

In [2]: possible_categories = list('abc')

In [3]: dtype = pd.CategoricalDtype(categories=possible_categories)

In [4]: cat = pd.Series(list('aba'), dtype=dtype)
In [5]: cat
Out[5]: 
0    a
1    b
2    a
dtype: category
Categories (3, object): [a, b, c]

Then, get_dummies will do exactly what you want!

In [6]: pd.get_dummies(cat)
Out[6]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0

There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

EDIT:

I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

T.C. Proctor
  • 6,096
  • 6
  • 27
  • 37
  • Nice, I didn't know about that data type in Pandas, thanks! – Berne May 26 '16 at 23:28
  • Uh, well, I chose piRSquared's answer because it was clear, concise, and adapted to the code I already had. Plus, it was the one I ended up using in what I was doing, so in a way it was the one that solved my problem. Yours was more *informative* as a whole, granted, but it's not the one I ultimately used, that's why I didn't change it to yours, sorry... I'd give it bonus points if I could though. – Berne May 27 '16 at 01:39
  • 1
    Seems like astype does not accept categories anymore and you need to pass CategoricalDtype(categories=[...]) instead (https://stackoverflow.com/questions/37952128/pandas-astype-categories-not-working) – Nicoowr Apr 17 '20 at 12:27
  • @NicoLi I'm wondering if this is actually a deprecation. The [1.0.1 docs](https://pandas.pydata.org/pandas-docs/version/1.0.1/user_guide/categorical.html) still show this method in the categorical guide – T.C. Proctor Apr 17 '20 at 16:36
  • @NicoLi Also, the question you linked to is three years old. I'm pretty sure I've been using categoricals this way for a while. Though I don't have a super up-to-date install, I'm pretty sure it's only a year or two old. – T.C. Proctor Apr 17 '20 at 16:40
  • 1
    I just tried with pandas (1.0.1), pd.Series.astype('category') does work, but additional argument categories=[...] is not recognized. – Nicoowr Apr 17 '20 at 17:23
  • 1
    @NicoLi, Ah, that seems to follow the docs. I updated the answer accordingly. – T.C. Proctor Apr 17 '20 at 17:30
37

Using transpose and reindex

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    by using `reindex`'s `columns` keyword (i.e. `dummies.reindex(columns=cats)`), you don't need to do the double transpose. – T.C. Proctor May 27 '16 at 04:30
  • 9
    Also `reindex` has a `fill_value` parameter, which does what you've done with the `fillna`. Thus, the line before you print the result can be done with: `dummies = dummies.reindex(columns=cats, fill_value=0)`. – T.C. Proctor Jun 05 '16 at 15:14
  • What should we do if we don't have an idea of the number of columns in 'cats' ? – datascana Apr 27 '17 at 10:06
  • @datascana `df['cat'].unique()` will give you a list of all the values that are actually present in the data. – T.C. Proctor May 04 '17 at 23:27
  • A question recently marked as a dupe allowed be to discover this gem of an answer. – cs95 Sep 06 '17 at 10:32
5

I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical where you define all the possible categories.

df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])

get_dummies() will do the rest then as expected.

andre
  • 353
  • 3
  • 9
4

Try this:

In[1]: import pandas as pd
       cats = ["a", "b", "c"]

In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})

In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]: 
     a    b    c
0  1.0  0.0  0
1  0.0  1.0  0
2  1.0  0.0  0
Kapil Sharma
  • 1,412
  • 1
  • 15
  • 19
  • The `columns=cats` in the `get_dummies` here doesn't actually do anything. The `columns` option is for selecting a subset of the original data frame that you want encoded with dummy variables. It seems to ignore it if the requested columns don't appear in the data frame. It seems like it ought to produce an error, but it doesn't – T.C. Proctor May 27 '16 at 05:37
3

I don't think get_dummies provides this out of the box, it only allows for creating an extra column that highlights NaN values.

To add the missing columns yourself, you could use pd.concat along axis=0 to vertically 'stack' the DataFrames (the dummy columns plus a DataFrame id) and automatically create any missing columns, use fillna(0) to replace missing values, and then use .groupby('id') to separate the various DataFrame again.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Yeah, that's kind of the alternative I thought about, but I was hoping there might be something already implemented that would be simpler to use (not necessarily with `get_dummies`, but the only other alternative I found was `sklearn`'s `OneHotEncoder` which doesn't seem to help much either...) – Berne May 25 '16 at 00:58
  • You may just as well skip `get_dummies` and just create all `0`-`1` columns yourself based off the category column itself. I guess depends a bit on the size of your problem. – Stefan May 25 '16 at 01:30
3

Adding the missing category in the test set:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset

Thibault Clement
  • 2,360
  • 2
  • 13
  • 17
2

As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.

# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']

# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) 

# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
    X_train[col] = X_train[col].astype('category', categories = df[col].unique())
    X_test[col] = X_test[col].astype('category', categories = df[col].unique())

# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])
Rudr
  • 387
  • 4
  • 20
2

The shorter the better:

import pandas as pd

cats = pd.Index(['a', 'b', 'c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

pd.get_dummies(df, prefix='', prefix_sep='').reindex(columns = cats, fill_value=0)

Result:

    a   b   c
0   1   0   0
1   0   1   0
2   1   0   0

Notes:

  • cats need to be a pandas index
  • prefix='' and prefix_sep='' need to be set in order to use the cats category as you defined in a first place. Otherwise, get_dummies converts into: cats_a, cats_b and cats_c). To me this is better because it is explicit.
  • use the fill_value=0 to convert the NaN from column c. Alternatively, you can use fillna(0) at the end of the sentence. (I don't which is faster).

Here's a shorter-shorter version (changed the Index values):

import pandas as pd

cats = pd.Index(['cat_a', 'cat_b', 'cat_c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

pd.get_dummies(df).reindex(columns = cats, fill_value=0)

Result:

    cat_a   cat_b   cat_c
0   1         0     0
1   0         1     0
2   1         0     0

Bonus track!

I imagine you have the categories because you did a previous dummy/one hot using training data. You can save the original encoding (.columns), and then apply during production time:

cats = pd.Index(['cat_a', 'cat_b', 'cat_c']) # it might come from the original onehot encoding (df_ohe.columns)

import pickle

with open('cats.pickle', 'wb') as handle:
    pickle.dump(cats, handle, protocol=pickle.HIGHEST_PROTOCOL)


with open('cats.pickle', 'rb') as handle:
    saved_cats = pickle.load(handle)



df = pd.DataFrame({'cat': ['a', 'b', 'a']})

pd.get_dummies(df).reindex(columns = saved_cats, fill_value=0)

Result:

    cat_a   cat_b   cat_c
0   1         0     0
1   0         1     0
2   1         0     0
Pablo Casas
  • 868
  • 13
  • 15
1

If you know your categories you can first apply pd.get_dummies() as you suggested and add the missing category columns afterwards.

This will create your example with the missing cat_c:

import pandas as pd

categories = ['a', 'b', 'c']
df = pd.DataFrame(list('aba'), columns=['cat'])
df = pd.get_dummies(df)

print(df)

   cat_a  cat_b
0      1      0
1      0      1
2      1      0

Now simply add the missing category columns with a union operation (as suggested here).

possible_categories = ['cat_' + cat for cat in categories]

df = df.reindex(df.columns.union(possible_categories, sort=False), axis=1, fill_value=0)

print(df)

   cat_a  cat_b  cat_c
0      1      0      0
1      0      1      0
2      1      0      0

mhellmeier
  • 1,982
  • 1
  • 22
  • 35
0

I was recently looking to solve this same issue, but working with a multi-column dataframe and with two datasets (a train set and test set for a machine learning task). The test dataframe had the same categorical columns as the train dataframe, but some of these columns had missing categories that were present in the train dataframe.

I did not want to manually define all the possible categories for every column. Instead, I combined the train and test dataframes into one, called get_dummies, and then split that back into two.

# train_cat, test_cat are dataframes instantiated elsewhere

train_test_cat = pd.concat([train_cat, test_cat]
tran_test_cat = pd.get_dummies(train_test_cat, axis=0))

train_cat = train_test_cat.iloc[:train_cat.shape[0], :]
test_cat = train_test_cat.iloc[train_cat.shape[0]:, :]
pleanbean
  • 11
  • 1
  • 1
    You should not be mixing train and test. It might work for your example, but we should treat test as data we have never seen before. If we do like you said, we might add a label that is not on the training set to the training set. – Leonardo Neves Sep 26 '20 at 20:49
0

Change the column to categorical and everything will work. For eg.

df = pd.DataFrame({'a': ['one', 'two', 'three', 'three'], 
                   'b':['hello', 'hello', 'hello', 'hello']})

# outputs this
       a      b
0    one  hello
1    two  hello
2  three  hello
3  three  hello

Now turn them into categorical variable like this:

a_categories = ['one', 'two', 'three', 'four']
b_categories = ['hello', 'world']
df['a'] = pd.Categorical(df['a'], categories=a_categories, ordered=False)
df['b'] = pd.Categorical(df['b'], categories=b_categories, ordered=False)

Get the dummies and it will look like this:

pd.get_dummies(df)

# outputs:
   a_one  a_two  a_three  a_four  b_hello  b_world
0   True  False    False   False     True    False
1  False   True    False   False     True    False
2  False  False     True   False     True    False
3  False  False     True   False     True    False
DrGeneral
  • 1,844
  • 1
  • 16
  • 22