117

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

I want:

df.desired  <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))

Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.

JulienD
  • 7,102
  • 9
  • 50
  • 84
John Horton
  • 4,122
  • 6
  • 31
  • 45

10 Answers10

138

Use the model.matrix function:

model.matrix( ~ Species - 1, data=iris )
Jaap
  • 81,064
  • 34
  • 182
  • 193
Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • 1
    Can I just add that this method was so much faster than using `cast` for me. – Matt Weller Dec 08 '13 at 15:03
  • @RyanChase, in the 14 hours between you writing your comment and me noticing it to respond you could have looked at the help page `?formula` and found the answer in the 2nd paragraph of the Details section. Or you could have tried the code with and without the "-1" and compared the output to see the effects. But I guess you are more patient that I am. The "-1" specifies to not fit an intercept (there are other ways as well) and therefore to create an indicator variable for each level rather than differences based on contrasts. – Greg Snow Sep 26 '15 at 19:52
  • 4
    @GregSnow I reviewed the 2nd paragraph of `?formula` as well as `?model.matrix`, but it was unclear (could just be my lack of depth of knowledge in matrix algebra and model formulation). After digging more, I've been able to gather that the -1 is just specifying not to include the "intercept" column. If you leave out the -1, you'll see an intercept column of 1's in the output with one binary column left out. You're able to see which values the omitted column are 1's based on rows where the values of the other columns are 0's. The documentation seems cryptic -is there another good resource? – Ryan Chase Oct 05 '15 at 22:25
  • 1
    @RyanChase, there are many online tutorials and books about R/S (several that have brief descriptions on the r-project.org webpage). My own learning of S and R has been rather eclectic (and long), so I am not the best to give an opinion on how current books/tutorials appeal to beginners. I am, however, a fan of experimentation. Trying something out in a fresh R session can be very enlightening and not dangerous (the worst that has happened to me is crashing R, and that rarely, which lead to improvements in R). Stackoverflow is then a good resource for understanding what happened. – Greg Snow Oct 06 '15 at 16:15
  • 8
    And if you want to convert all factor columns, you can use: ```model.matrix(~., data=iris)[,-1]``` – user890739 Jan 05 '16 at 00:32
  • @DestaHaileselassieHagos, the return value is a matrix, you can change the names however you want using functions like `dimnames` or `colnames`. Functions like `sub` or `gsub` may be of help as well. – Greg Snow Mar 30 '16 at 16:56
  • This answer is incomplete. How do you merge the output of matrix.model back with the columns in the original data frame, as the OP asked? – stackoverflowuser2010 May 21 '16 at 00:48
  • @stackoverflowuser2010, the original question asked for the indicators to be in a "new" data frame, not merged with the original (note the reference to techniques requiring only numeric data). But if you really want it combined with the original you can use `cbind`. – Greg Snow May 23 '16 at 17:45
  • This method does not work for numeric or date types without first coercing the column type to factor. – John Haberstroh Nov 28 '18 at 17:40
  • resurrecting this thread- any way to have this pass NA values when present? I realize the intention of the `model.matrix()` function is to create something without NA values for analysis, but if I am trying to use it for something else, this breaks down. Could write a custom function to do this, but would be cool if there was something embedded here that I am missing... – colin Dec 18 '18 at 01:12
  • 1
    @colin, Not fully automatic, but you can use `naresid` to put the missing values back in after using `na.exclude`. A quick example: `tmp <- data.frame(x=factor(c('a','b','c',NA,'a'))); tmp2 <- na.exclude(tmp); tmp3 <- model.matrix( ~x-1, tmp2); tmp4 <- naresid(attr(tmp2,'na.action'), tmp3)` – Greg Snow Dec 18 '18 at 18:57
18

If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :

R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
  eggs.bar eggs.foo ham.blue ham.green ham.red
1        0        1        0         0       1
2        0        1        1         0       0
3        1        0        0         1       0
4        1        0        0         0       1

Not exactly the case you are describing, but it can be useful too...

juba
  • 47,631
  • 14
  • 113
  • 118
  • Thanks, this helped me a lot as it uses less memory then model.matrix! – Serhiy May 11 '15 at 15:21
  • I like the way the variables get named; I dislike that they are returned as storage-hungry numeric when they *should* (IMHO) just be logicals. – dsz Aug 26 '16 at 01:08
8

A quick way using the reshape2 package:

require(reshape2)

> dcast(df.original, ham ~ eggs, length)

Using ham as value column: use value_var to override.
  ham bar foo
1   1   0   1
2   2   0   1
3   3   1   0
4   4   1   0

Note that this produces precisely the column names you want.

Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
  • Good. But be care of the duplicate of ham. say, d <- data.frame(eggs = c("foo", "bar", "foo"), ham = c(1,2,1)); dcast(d, ham ~ eggs, length) makes foo = 2. – kohske Feb 19 '11 at 22:58
  • 1
    @Kohske, true, but I was assuming `ham` is a unique row id. If `ham` is not a unique id then one must use some other unique-id (or create a dummy one) and use that in place of `ham`. Converting a categorical label to a binary indicator would only make sense for unique ids. – Prasad Chalasani Feb 19 '11 at 23:42
7

probably dummy variable is similar to what you want. Then, model.matrix is useful:

> with(df.original, data.frame(model.matrix(~eggs+0), ham))
  eggsbar eggsfoo ham
1       0       1   1
2       0       1   2
3       1       0   3
4       1       0   4
kohske
  • 65,572
  • 8
  • 165
  • 155
6

A late entry class.ind from the nnet package

library(nnet)
 with(df.original, data.frame(class.ind(eggs), ham))
  bar foo ham
1   0   1   1
2   0   1   2
3   1   0   3
4   1   0   4
mnel
  • 113,303
  • 27
  • 265
  • 254
4

Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.

dummy <- function(df) {  

    NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
    FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]

    require(ade4)
    if (is.null(ncol(NUM(df)))) {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
        names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
    } else {
        DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
    }
    return(DF)
} 

Let's try it.

df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"), x=rnorm(4))     
dummy(df)

df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), 
            ham = c("red","blue","green","red"))  
dummy(df2)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
3

Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
#   eggs ham
# 1  foo   1
# 2  foo   2
# 3  bar   3
# 4  bar   4

# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
#   eggsbar eggsfoo
# 1       0       1
# 2       0       1
# 3       1       0
# 4       1       0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
#   bar foo
# 1   0   1
# 2   0   1
# 3   1   0
# 4   1   0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"

# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
#   eggs ham bar foo
# 1  foo   1   0   1
# 2  foo   2   0   1
# 3  bar   3   1   0
# 4  bar   4   1   0

# At this point, you can select out the columns that you want.
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
0

I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package. This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.

# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
  exploders <- colnames(data)[sapply(data, function(col){
      is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
    })]
  if (length(exploders) > 0) {
    exploded <- lapply(exploders, function(exp){
        col <- data[, exp]
        n <- length(col)
        dummies <- matrix(values[1], n, length(levels(col)))
        dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
        colnames(dummies) <- paste(exp, levels(col), sep = '_')
        dummies
      })
    # Only keep numeric data.
    data <- data[sapply(data, is.numeric)]
    # Add exploded values.
    data <- cbind(data, exploded)
  }
  return(data)
}
rakensi
  • 1,437
  • 1
  • 15
  • 20
0

(The question is 10yo, but for the sake of completeness...)

The function i() from the fixest package does exactly that.

Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:

  • binning values (with the argument 'bin'),
  • excluding some factor values (with the argument ref).

And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).

Here's an example:

library(fixest)
data(airquality)
table(airquality$Month)
#>  5  6  7  8  9 
#> 31 30 31 31 30

head(i(airquality$Month))
#>      5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0

#
# Binning (check out the help, there are many many ways to bin)
#

colSums(i(airquality$Month, bin = 5:6)))
#>  5  7  8  9 
#> 61 31 31 30 

#
# References
#

head(i(airquality$Month, ref = c(6, 9)), 3)
#>      5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0

And here's a little wrapper expanding all non-numeric variables (by default):

library(fixest)

# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
    
    if(missing(var)){
        var = names(data)[!sapply(data, is.numeric)]
        if(length(var) == 0) return(data)
    }
    
    data_list = unclass(data)
    new = lapply(var, \(x) i(data_list[[x]]))
    data_list[names(data_list) %in% var] = new
    
    do.call("cbind", data_list)
}

my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

expand_factor(my_data)
#>      bar foo ham
#> [1,]   0   1   1
#> [2,]   0   1   2
#> [3,]   1   0   3
#> [4,]   1   0   4

Finally, for those wondering, the timing is equivalent to the model.matrix solution.

library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))

microbenchmark(mm = model.matrix(~x, my_data),
               i = i(my_data$x), times = 5)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>    mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443     5
#>     i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550     5

Laurent Bergé
  • 1,292
  • 6
  • 8
0

In sapply == over eggs could be used to generate dummy vectors:

x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
#  foo bar ham
#1   1   0   1
#2   1   0   2
#3   0   1   3
#4   0   1   4

all.equal(x, df.desired)
#[1] TRUE

A maybe faster variant - Result best used as list or data.frame:

. <- unique(df.original$eggs)
with(df.original, 
     data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))

Indexing in a matrix - Result best used as matrix:

. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
                 dimnames=list(NULL, .)), df.original["ham"])

Using outer - Result best used as matrix:

. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])

Using rep - Result best used as matrix:

. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
 df.original["ham"])
GKi
  • 37,245
  • 2
  • 26
  • 48