23

pandas.get_dummies emits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?

Needed to avoid co-linearity in our dataset.

Josh D.
  • 1,068
  • 9
  • 18
ihadanny
  • 4,377
  • 7
  • 45
  • 76

2 Answers2

33

Pandas version 0.18.0 implemented exactly what you're looking for: the drop_first option. Here's an example:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: u'0.18.1'

In [3]: s = pd.Series(list('abcbacb'))

In [4]: pd.get_dummies(s, drop_first=True)
Out[4]: 
     b    c
0  0.0  0.0
1  1.0  0.0
2  0.0  1.0
3  1.0  0.0
4  0.0  0.0
5  0.0  1.0
6  1.0  0.0
T.C. Proctor
  • 6,096
  • 6
  • 27
  • 37
3

There are a number of ways of doing so.

Possibly the simplest is replacing one of the values by None before calling get_dummies. Say you have:

import pandas as pd
import numpy as np
s = pd.Series(list('babca'))
>> s
0    b
1    a
2    b
3    c
4    a

Then use:

>> pd.get_dummies(np.where(s == s.unique()[0], None, s))
    a   c
0   0   0
1   1   0
2   0   0
3   0   1
4   1   0

to drop b.

(Of course, you need to consider if your category column doesn't already contain None.)


Another way is to use the prefix argument to get_dummies:

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)

prefix: string, list of strings, or dict of strings, default None - String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternativly, prefix can be a dictionary mapping column names to prefixes.

This will append some prefix to all of the resulting columns, and you can then erase one of the columns with this prefix (just make it unique).

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 3
    will try these! but don't you agree it's weird that such a common requirement isn't implemented as some parameter of get_dummies? – ihadanny Jul 19 '15 at 11:49
  • @ihadanny Not sure I personally encountered a learner that needed this representation. Do you have some example? – Ami Tavory Jul 19 '15 at 16:25
  • 1
    Any regression with a constant term will have a problem (though most stats programs are smart enough to delete collinear variables automatically). Stata, for example, will automatically use n-1 dummies in a regression to avoid this issue. I'm not sure if statsmodels will deal with this automatically or not. – JohnE Jul 19 '15 at 18:18
  • @AmiTavory, won't good old scikit.LinearSVC get confused by colinear dependent variables? – ihadanny Jul 19 '15 at 19:21
  • @JohnE Interesting point. I usually use QR decomposition to filter out (massively wide) matrices anyway, so I might have missed it. Thanks. – Ami Tavory Jul 19 '15 at 21:51
  • @ihadanny Thanks for this point. I'll actually look into this when I have more time, but it's very possible that you're right. – Ami Tavory Jul 19 '15 at 21:52