Creating indicator variable columns in dplyr chain

Question

Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame() created var as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.

--original--

I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var has level a, new column var_a will be 1, and all other rows of var_a will be 0.

The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.

library(dplyr)
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
  df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
}

Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var, which could be converted to type factor.

I am not sure how many groups you have in the real dataset. You can create the function `dummy <- function(x, f) (x == f)*1` and use `mutate` to create the dummy variable. However, you have to type the column name manually, but it works if you don't have too many categories. — JasonWang, Mar 11 '16 at 17:19
Good point, @Hong Ooi; nevertheless the question is a nice one, and the desperate answers demonstrate that some things do not fit into the `dplyr` scheme. @Tom's original solution looks like the most elegant version. There are case, where do-loops are more readable than xxapply. — Dieter Menne, Mar 11 '16 at 18:56
Hong, I'm not sure I understand your question. model.matrix() is fine as a solution (compared to the for() loop that I posted)--I wasn't familiar with it--but other than wrapping it in a function, as you do below, it won't work within the dplyr chain, and that's what I was shooting for. — Tom, Mar 12 '16 at 16:00
Creating matrices of indicator variables is _exactly the task_ that `model.matrix` is meant to do. It also handles complications like aliasing, interactions, non-factor variables and so on. It outputs a matrix because by the time you're creating a model matrix, you've generally done with data transformation and symbolic manipulation. If you want to continue doing transformations and manipulations you'll have to write code -- but that's why R is a programming language. — Hong Ooi, Mar 12 '16 at 16:06
Fyi, you should use `set.seed` before making a random example data set. — Frank, Mar 12 '16 at 18:07
I agree with @Hong that you should use the task best suited to the job and all the edge cases you are not yet thinking about (in this case, that's `model.matrix`). If you really insist on using hadleyverse chainables, this works: `library(tidyr); df %>% spread(var, var) %>% mutate_each(funs(. %>% is.na %>% \`!\` %>% \`+\`))` — Frank, Mar 12 '16 at 18:15
Frank, thanks for the reminder on set.seed, and for the tidyr solution. Hong, thank you for your help. You've convinced me that `model.matrix()` is the right solution for this general problem. I had thought there would be a simpler way to do this within a dplyr chain, and am somewhat surprised that there's not, but I'll be happy to drop out of the chain to use `model.matrix()` when needed. — Tom, Mar 12 '16 at 22:44
You can use model.matrix within the dplyr chain like this: `df %>% data.frame(model.matrix(~var+0,.)) %>% [other dplyr commands]` — andyyy, Feb 27 '18 at 13:47

score 5 · Accepted Answer · answered Mar 11 '16 at 16:46

It's not pretty, but this function should work

dummy <- function(data, col) {
    for(c in col) {
        idx <- which(names(data)==c)
        v <- data[[idx]]
        stopifnot(class(v)=="factor")
        m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
        m[cbind(seq_along(v), as.integer(v))]<-1
        colnames(m) <- paste(c, levels(v), sep="_")
        r <- data.frame(m)
        if ( idx>1 ) {
            r <- cbind(data[1:(idx-1)],r)
        }
        if ( idx<ncol(data) ) {
            r <- cbind(r, data[(idx+1):ncol(data)])
        }
        data <- r
    }
    data
}

Here's a sample data.frame

dd <- data.frame(a=runif(30),
    b=sample(letters[1:3],30,replace=T),
    c=rnorm(30),
    d=sample(letters[10:13],30,replace=T)
)

and you specify the columns you want to expand as a character vector. You can do

dd %>% dummy("b")

or

dd %>% dummy(c("b","d"))

Hong Ooi · Answer 2 · 2016-03-13T11:42:11.597

3

The only requirements for a function to be part of a dplyr pipeline are that it takes a data frame as input, and returns a data frame as output. So, leveraging model.matrix:

make_inds <- function(df, cols=names(df))
{
    # do each variable separately to get around model.matrix dropping aliased columns
    do.call(cbind, c(df, lapply(cols, function(n) {
        x <- df[[n]]
        mm <- model.matrix(~ x - 1)
        colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm))
        mm
    })))
}

# insert into pipeline
data %>% ... %>% make_inds %>% ...

edited Mar 13 '16 at 11:42

answered Mar 11 '16 at 17:10

Hong Ooi

56,353
13
134
187

Hong, actually this isn't working. I find that have have to fix the closing parentheses, and even when the make_inds() function passes R's checks, I'm receiving an error when I run it within the dplyr chain. – Tom Mar 12 '16 at 15:53
My code fixed. Can't fix your code until you post it. – Hong Ooi Mar 12 '16 at 15:55
1

I think that I've tracked down the problem...the closing parentheses at `lapply(cols)` needs to be moved after the function definition: `})))`. – Tom Mar 12 '16 at 16:41
this is great. I modified by adding mm <- mm[,-which.max(colSums(mm))] in order to drop one variable to serve as reference group (to prevent perfect collinearity). In general, I will drop the largest group. – justin cress Feb 05 '18 at 19:05

alistaire · Answer 3 · 2016-03-12T20:19:34.247

It's possible without creating a function, although it does require lapply. If var is a factor, you can work with its levels; we can bind its columns to an lapply which loops over the levels of var and creates the values, names them with setNames, and converts them into a tbl_df.

df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var), 
                                               function(x){as.integer(df$var == x)}), 
                                        paste0('var2_', levels(df$var)))))

returns

Source: local data frame [10 x 5]

      var var_d var_c var2_c var2_d
   (fctr) (dbl) (dbl)  (int)  (int)
1       d     1     0      0      1
2       c     0     1      1      0
3       c     0     1      1      0
4       c     0     1      1      0
5       d     1     0      0      1
6       d     1     0      0      1
7       c     0     1      1      0
8       c     0     1      1      0
9       d     1     0      0      1
10      c     0     1      1      0

If var is a character vector, not a factor, you can do the same thing, but using unique instead of levels:

df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var), 
                                               function(x){as.integer(df$var == x)}), 
                                        paste0('var2_', unique(df$var)))))

Two notes:

This approach will work regardless of the data type, but will be slower. In your data is big enough that it matters, it likely makes sense to store the data as factor anyway, as it contains a lot of repeated levels.
Both versions pull data from df$var as it lives in the calling environment, not as it may exist in a larger chain, and assume var is unchanged in whatever it is passed. To reference the dynamic value of var aside from dplyr's normal NSE is rather a pain, insofar as I've seen.

One more alternative that's a little simpler and factor-agnostic, using reshape2::dcast:

library(reshape2)
df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))

It still pulls the version of df from the calling environment, so the chain really only determines what you're joining to. Because it uses cbind instead of bind_cols, the result will be a data.frame, too, not tbl_df, so if you want to keep it all tbl_df (smart if the data is big), you'll need to replace the cbind with bind_cols(as_data_frame( ... )); bind_cols doesn't seem to want to do the conversion for you.

Note, however, that while this version is simpler, it is comparatively slower, both on factor data:

Unit: microseconds
   expr      min        lq      mean    median       uq      max neval
 factor  358.889  384.0010  479.5746  427.9685  501.580 3995.951   100
 unique  547.249  585.4205  696.4709  633.4215  696.402 4528.099   100
  dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796   100

and string data:

Unit: microseconds
   expr      min       lq      mean    median        uq      max neval
 unique  307.190  336.422  414.1031  362.6485  419.3625 3693.340   100
  dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178   100

For small data it won't matter, but for bigger data, it may be worth putting up with the complication.

It seems that my minimal example was too minimal; my apologies. data.frame() automatically generates `var` as a factor, but if we use dplyr::data_frame(), we get a character vector (which is what my "real" data set has). Then `levels()` returns NULL, forcing us to break the dplyr chain: `df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))` `df <- df %>% mutate(var = as.factor())` `df <- bind_cols(...)` — Tom, Mar 12 '16 at 17:01
Sorry, I must have been tired; the `mutate` term wasn't doing anything. Edited and clarified, plus added a `reshape2` version which is nice for interactive work. — alistaire, Mar 12 '16 at 20:21

score 2 · Answer 4 · answered Dec 10 '18 at 21:44

I landed on this Q&A first because I really wanted to put model.matrix in a magrittr pipe workflow or produce the equivalent output with just tidyverse functions (sorry, baseRs).

Later, I landed on this solution that had the elegant use of the functions that I thought was possible (but I wasn't coming up with on my own):

df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))

df %>% 
  mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for `spread` to work.
  mutate(dummy = 1) %>% 
  spread(var, dummy, fill = 0)

So, I'm adding an updated/modified version of the linked solution so that people who land here first don't have to keep looking (like I did).

Creating indicator variable columns in dplyr chain

4 Answers4