47

I'm trying to create a series of dummy variables from a categorical variable using pandas in python. I've come across the get_dummies function, but whenever I try to call it I receive an error that the name is not defined.

Any thoughts or other ways to create the dummy variables would be appreciated.

EDIT: Since others seem to be coming across this, the get_dummies function in pandas now works perfectly fine. This means the following should work:

import pandas as pd

dummies = pd.get_dummies(df['Category'])

See http://blog.yhathq.com/posts/logistic-regression-and-python.html for further information.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
user1074057
  • 1,772
  • 5
  • 20
  • 30
  • https://stackoverflow.com/questions/75551546/i-want-to-create-a-dummy-variable-for-a-range I have a problem creating dummy columns can someone help? – priyal shah Feb 24 '23 at 00:36

12 Answers12

38

When I think of dummy variables I think of using them in the context of OLS regression, and I would do something like this:

import numpy as np
import pandas as pd
import statsmodels.api as sm

my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                


df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
just_dummies = pd.get_dummies(df['dummy'])

step_1 = pd.concat([df, just_dummies], axis=1)      
step_1.drop(['dummy', 'c'], inplace=True, axis=1)
# to run the regression we want to get rid of the strings 'a', 'b', 'c' (obviously)
# and we want to get rid of one dummy variable to avoid the dummy variable trap
# arbitrarily chose "c", coefficients on "a" an "b" would show effect of "a" and "b"
# relative to "c"
step_1 = step_1.applymap(np.int) 

result = sm.OLS(step_1['y'], sm.add_constant(step_1[['x', 'a', 'b']])).fit()
print result.summary()
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 3
    Consideration of the dummy trap! Very good. Dropping a dummy variable column is easy enough, but you'd think get_dummies would have this as an option. – conner.xyz Aug 21 '15 at 13:48
  • I think this should be the best answer. it only lacks one thing `import statsmodels.api as sm`.. so that anyone can try it on her/his Ipython notebook – stackunderflow Sep 30 '15 at 14:40
  • 1
    @stackunderflow, Fixed. Thank You. – Akavall Sep 30 '15 at 14:49
  • 9
    Update: pandas version 0.18.0, `get_dummies` now has a `drop_first` parameter that, if set to `True` will drop the first dummy variable. Ex: `pd.get_dummies(df['dummy'], drop_first=True)` – Jarad May 25 '16 at 20:58
  • get_dummies has a drop_first option. – benji Jan 16 '18 at 02:20
24

Based on the official documentation:

dummies = pd.get_dummies(df['Category']).rename(columns=lambda x: 'Category_' + str(x))
df = pd.concat([df, dummies], axis=1)
df = df.drop(['Category'], inplace=True, axis=1)

There is also a nice post in the FastML blog.

beyondfloatingpoint
  • 1,239
  • 1
  • 14
  • 23
  • 7
    Since you do inplace=True in the last line, you return none and end up with an empty dataframe. I'd update the last line: df = df.drop(['Category'], axis=1) – ori-k Jul 20 '16 at 17:21
23

It's hard to infer what you're looking for from the question, but my best guess is as follows.

If we assume you have a DataFrame where some column is 'Category' and contains integers (or otherwise unique identifiers) for categories, then we can do the following.

Call the DataFrame dfrm, and assume that for each row, dfrm['Category'] is some value in the set of integers from 1 to N. Then,

for elem in dfrm['Category'].unique():
    dfrm[str(elem)] = dfrm['Category'] == elem

Now there will be a new indicator column for each category that is True/False depending on whether the data in that row are in that category.

If you want to control the category names, you could make a dictionary, such as

cat_names = {1:'Some_Treatment', 2:'Full_Treatment', 3:'Control'}
for elem in dfrm['Category'].unique():
    dfrm[cat_names[elem]] = dfrm['Category'] == elem

to result in having columns with specified names, rather than just string conversion of the category values. In fact, for some types, str() may not produce anything useful for you.

ely
  • 74,674
  • 34
  • 147
  • 228
9

The following code returns dataframe with the 'Category' column replaced by categorical columns:

df_with_dummies = pd.get_dummies(df, prefix='Category_', columns=['Category'])

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

Spas
  • 840
  • 16
  • 13
2

For my case, dmatrices in patsy solved my problem. Actually, this function is designed for the generation of dependent and independent variables from a given DataFrame with an R-style formula string. But it can be used for the generation of dummy features from the categorical features. All you need to do would be drop the column 'Intercept' that is generated by dmatrices automatically regardless of your original DataFrame.

import pandas as pd
from patsy import dmatrices

df_original = pd.DataFrame({
   'A': ['red', 'green', 'red', 'green'],
   'B': ['car', 'car', 'truck', 'truck'],
   'C': [10,11,12,13],
   'D': ['alice', 'bob', 'charlie', 'alice']},
   index=[0, 1, 2, 3])

_, df_dummyfied = dmatrices('A ~ A + B + C + D', data=df_original, return_type='dataframe')
df_dummyfied = df_dummyfied.drop('Intercept', axis=1)

df_dummyfied.columns    
Index([u'A[T.red]', u'B[T.truck]', u'D[T.bob]', u'D[T.charlie]', u'C'], dtype='object')

df_dummyfied
   A[T.red]  B[T.truck]  D[T.bob]  D[T.charlie]     C
0       1.0         0.0       0.0           0.0  10.0
1       0.0         0.0       1.0           0.0  11.0
2       1.0         1.0       0.0           1.0  12.0
3       0.0         1.0       0.0           0.0  13.0
precise
  • 439
  • 1
  • 4
  • 13
2

You can create dummy variables to handle the categorical data

# Creating dummy variables for categorical datatypes
trainDfDummies = pd.get_dummies(trainDf, columns=['Col1', 'Col2', 'Col3', 'Col4'])

This will drop the original columns in trainDf and append the column with dummy variables at the end of the trainDfDummies dataframe.

It automatically creates the column names by appending the values at the end of the original column name.

rzskhr
  • 931
  • 11
  • 10
1

A very simple approach without using get_dummies if you have very less categorical variable using NumPy and Pandas.

let, i have a column named <"State"> and it have 3 categorical variable <'New York'>, <'California'> and <'Florida'> and we want to assign 0 and 1 for respectively.

we can do it with following simple code.

import numpy as np
import pandas as pd

dataset['NewYork_State'] = np.where(dataset['State']=='New York', 1, 0)
dataset['California_State'] = np.where(dataset['State']=='California', 1, 0)
dataset['Florida_State'] = np.where(dataset['State']=='Florida', 1, 0)
 

Above we create Three New Columns for storing values "NewYork_State", "California_State", "Florida_State".

Drop the original column

dataset.drop(columns=['State'],axis=1,inplace=True)
0

So I was actually needing an answer to this question today (7/25/2013), so I wrote this earlier. I've tested it with some toy examples, hopefully you get some mileage out of it

def categorize_dict(x, y=0):
    # x Requires string or numerical input
    # y is a boolean that specifices whether to return category names along with the dict.
    # default is no
    cats = list(set(x))
    n = len(cats)
    m = len(x)
    outs = {}
    for i in cats:
        outs[i] = [0]*m
    for i in range(len(x)):
        outs[x[i]][i] = 1
    if y:
        return outs,cats
    return outs
  • 1
    I edited the original question to reflect the newest version of pandas. The `get_dummies` function works just fine now. – user1074057 Aug 08 '13 at 17:16
0

I created a dummy variable for every state using this code.

def create_dummy_column(series, f):
    return series.apply(f)

for el in df.area_title.unique():
    col_name = el.split()[0] + "_dummy"
    f = lambda x: int(x==el)
    df[col_name] = create_dummy_column(df.area_title, f)
df.head()

More generally, I would just use .apply and pass it an anonymous function with the inequality that defines your category.

(Thank you to @prpl.mnky.dshwshr for the .unique() insight)

userFog
  • 10,685
  • 1
  • 15
  • 7
0

Handling categorical features scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3) Unordered categories: use dummy encoding (0/1) What are the categorical features in our dataset?

Ordered categories: weather (already encoded with sensible numeric values) Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded) For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:

# An utility function to create dummy variable
`def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df`
0

A simple and robust way to create dummies based on a column with your category values:

for category in list(df['category_column'].unique()):
    df[category] = lis(map(lambda x: 1 if x==category else 0, df['category_column']))

But watch out when doing some OLS regression because you will need to exclude one of the categorys so you dont fall on dummie trap variable

Ramon
  • 45
  • 7
0

If you want to replace a list of variables with dummy features:

# create an empty list to store the dataframes
   dataframes = []

# iterate over the list of categorical features
 for feature in categoricalFeatures:

   # create a dataframe with dummy variables for the current feature
      df_feature = pd.get_dummies(df_raw[feature])

   # add the dataframe to the list
      dataframes.append(df_feature)`

# concatenate the dataframes to create a single dataframe
  df_dummies = pd.concat(dataframes, axis=1)
  df_final = pd.concat([df_raw, df_dummies], axis=1).drop(columns = 
                                                      categoricalFeatures, axis = 1)