40

I have a several data sets with 75,000 observations and a type variable that can take on a value 0-4. I want to add five new dummy variables to each data set for all types. The best way I could come up with to do this is as follows:

# For the 'binom' data set create dummy variables for all types in all data sets
binom.dummy.list<-list()
for(i in 0:4){
    binom.dummy.list[[i+1]]<-sapply(binom$type,function(t) ifelse(t==i,1,0))
}

# Add and merge data
binom.dummy.df<-as.data.frame(do.call("cbind",binom.dummy.list))
binom.dummy.df<-transform(binom.dummy.df,id=1:nrow(binom))
binom<-merge(binom,binom.dummy.df,by="id")

While this works, it is incredibly slow (the merge function has even crashed a few times). Is there a more efficient way to do this? Perhaps this functionality is part of a package that I am not familiar with?

Community
  • 1
  • 1
DrewConway
  • 5,407
  • 7
  • 35
  • 32
  • `ifelse` is vectorized, so if I understand your code correctly, you don't need that `sapply`. And I wouldn't use merge - I would use SQLite or PostgreSQL. Some sample data would help too :-) – Vince Aug 02 '10 at 02:08

8 Answers8

49

R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.

> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
> head(binom)
          y          x catVar
1 0.5051653 0.34888390      2
2 0.4868774 0.85005067      2
3 0.3324482 0.58467798      2
4 0.2966733 0.05510749      3
5 0.5695851 0.96237936      1
6 0.8358417 0.06367418      2

You just do

> A <- model.matrix(y ~ x + catVar,binom) 
> head(A)
  (Intercept)          x catVar1 catVar2 catVar3 catVar4
1           1 0.34888390       0       1       0       0
2           1 0.85005067       0       1       0       0
3           1 0.58467798       0       1       0       0
4           1 0.05510749       0       0       1       0
5           1 0.96237936       1       0       0       0
6           1 0.06367418       0       1       0       0

Done.

gappy
  • 10,095
  • 14
  • 54
  • 73
  • 6
    Any easy way of going the opposite direction-ie you have the dummy variables but want to collapse them into one variable? – Misha Nov 24 '11 at 22:35
  • 1
    note that if you change the type of contrasts used you will get different results. Also, you will get different answers for ordered an unordered factors. The default contrasts set in R is `options(contrasts = c("contr.treatment", "contr.poly"))`. See `?contrasts` to add to your confusion. – geneorama Nov 19 '13 at 19:59
  • Also note that the example here has 5 categories, because the index starts at 0 `sample(0:4, 1e5 , TRUE)`. I don't think it is possible in base R to automatically generate all the levels of dummy variables. This particular example happens to omit any samples of 0, which would appear as a row of zeros in the model matrix. – geneorama Nov 19 '13 at 20:04
  • 1
    This method drops rows with NAs, which makes me prefer Joshua Ullrich's answer. And to clarify geneorama's point, for n levels of a variable you only need n-1 dummy variables to represent the information. (If for some reason you wanted to hack `model.matrix()` to explicitly represent all columns, you could add a reference level with no members as in `levels(binom$catVar) <- c("dummy", levels(binom$catVar)); A <- model.matrix(y ~ x + catVar,binom, contrasts = "contr.treatment")` but this redundancy seems risky if you're doing modeling.) – MattBagg Jun 23 '14 at 20:29
  • if you do not want intercept then use `A <- model.matrix(y ~ x + catVar -1, binom)` – Manoj Kumar Jun 19 '18 at 17:08
24

Drew, this is much faster and shouldn't cause any crashes.

> binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
> for(t in unique(binom$type)) {
+   binom[paste("type",t,sep="")] <- ifelse(binom$type==t,1,0)
+ }
> head(binom)
        data type type2 type4 type1 type3 type0
1 0.11787309    2     1     0     0     0     0
2 0.11884046    4     0     1     0     0     0
3 0.92234950    4     0     1     0     0     0
4 0.44759259    1     0     0     1     0     0
5 0.01669651    2     1     0     0     0     0
6 0.33966184    3     0     0     0     1     0
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 2
    nice solution. May i suggest, before the "paste" to include a little "make.names" in case the level names contain some litigious character. – agenis Jul 30 '14 at 16:24
16

What about using model.matrix()?

> binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
> head(binom)
       data type
1 0.1412164    2
2 0.8764588    2
3 0.5559061    4
4 0.3890109    3
5 0.8725753    3
6 0.8358100    1
> inds <- model.matrix(~ factor(binom$type) - 1)
> head(inds)
  factor(binom$type)0 factor(binom$type)1 factor(binom$type)2 factor(binom$type)3 factor(binom$type)4
1                   0                   0                   1                   0                   0
2                   0                   0                   1                   0                   0
3                   0                   0                   0                   0                   1
4                   0                   0                   0                   1                   0
5                   0                   0                   0                   1                   0
6                   0                   1                   0                   0                   0
griverorz
  • 677
  • 5
  • 11
3

If you're open to using the data.table package, mltools has a one_hot() method.

library(data.table)
library(mltools)

binom <- data.table(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
one_hot(binom)

                 y          x catVar_0 catVar_1 catVar_2 catVar_3 catVar_4
     1: 0.90511891 0.83045050        0        0        1        0        0
     2: 0.91375984 0.73273830        0        0        0        1        0
     3: 0.01926608 0.10301409        0        0        1        0        0
     4: 0.48691138 0.24428157        0        1        0        0        0
     5: 0.60660396 0.09132816        0        0        1        0        0
    ---                                                                   
 99996: 0.12908356 0.26157731        0        1        0        0        0
 99997: 0.96397273 0.98959000        0        1        0        0        0
 99998: 0.16818414 0.37460941        1        0        0        0        0
 99999: 0.72610508 0.72055867        1        0        0        0        0
100000: 0.89710998 0.24155507        0        0        0        0        1

Usage

one_hot(dt, cols = "auto", sparsifyNAs = FALSE, 
        naCols = FALSE, dropCols = TRUE,
        dropUnusedLevels = FALSE)

Which column(s) should be one-hot-encoded? cols = "auto" encodes all unordered factor columns. Therefore, the command below is equivalent. This is only important when the data.table contains factors that should not be encoded.

one_hot(binom, cols="catVar")
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
Ben
  • 20,038
  • 30
  • 112
  • 189
2

The recipes package can also be quite powerful to do this. The example below is quite verbose but it can be really clean as soon as you add more preprocessing steps.

library(recipes)

binom <- data.frame(y = runif(1e5), 
                    x = runif(1e5),
                    catVar = as.factor(sample(0:4, 1e5, TRUE))) # use the example from gappy
head(binom)

new_data <- recipe(y ~ ., data = binom) %>% 
  step_dummy(catVar) %>% # add dummy variable
  prep(training = binom) %>% # apply the preprocessing steps (could be more than just adding dummy variables)
  bake(newdata = binom) # apply the recipe to new data
head(new_data)

Other step examples are step_scale, step_center, step_pca, etc.

takje
  • 2,630
  • 28
  • 47
1

I did not have good luck with model.matrix() function as it was omitting some factor levels for whatever reason. However, i have had good luck with this simple function from library(fastDummies):

Columns that are converted into binary dummy variables have to be categorical.

fastDummies::dummy_cols(fastDummies_example, select_columns = "numbers", remove_selected_columns = "numbers")

coding_is_fun
  • 117
  • 1
  • 14
0

The nnet package for single-layer neural networks (which don't understand factors) has a conversion command: class.ind.

Jim Bang
  • 11
  • 3
0

You can use the package called dummies

binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
head(binom)

          y          x catVar
1 0.4143348 0.09721401      1
2 0.3140782 0.54340539      3
3 0.1262037 0.51820499      2
4 0.7159850 0.13167720      3
5 0.8203528 0.94116026      3
6 0.2169781 0.82020216      1

Solution:

library(dummies)
binom<-dummy.data.frame(binom)
head(binom)

          y          x catVar0 catVar1 catVar2 catVar3 catVar4
1 0.4143348 0.09721401       0       1       0       0       0
2 0.3140782 0.54340539       0       0       0       1       0
3 0.1262037 0.51820499       0       0       1       0       0
4 0.7159850 0.13167720       0       0       0       1       0
5 0.8203528 0.94116026       0       0       0       1       0
6 0.2169781 0.82020216       0       1       0       0       0
George Pipis
  • 1,452
  • 16
  • 12