R: create dummy variables based on a categorical variable of lists

Question

I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:

df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

And the desired form is a dummy variable for each unique string seen anywhere in df$y, i.e.:

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

This naive approach works:

> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+   df[,  n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+   colnames(df)[n + i] <- uniqueStrings[i]
+ }

However it is very ugly, lazy and slow with big data frames.

Any suggestions? Something fancy from the tidyverse?

UPDATE: I got 3 different approaches below. I tested them using system.time on my (Windows 7, 32GB RAM) laptop on a real dataset, comprising of 1M rows, each row containing a list of length 1 to 4 strings (out of ~350 unique string values), overall 200MB on disk. So the expected result is a data frame with dimensions 1M x 350. The tidyverse (@Sotos) and base (@joel.wilson) approaches took so long I had to restart R. The qdapTools (@akrun) approach however worked fantastic:

> system.time(res1 <- mtabulate(varsLists))
   user  system elapsed 
  47.05   10.27  116.82

So this is the approach I'll mark accepted.

or `data.frame(x = df$x, t(sapply(df$y, function(l){table(factor(l, levels = LETTERS[1:5]))})))` — alistaire, Jan 16 '17 at 09:31
@alistaire maybe `levels = unique(unlist(df$y))` instead of `LETTERS[1:5]` ? — Sotos, Jan 16 '17 at 09:46
@Sotos I had that, but figured this is less computation. The best route is to store that as a separate variable, but that would require a second line... — alistaire, Jan 16 '17 at 09:49

Sotos · Answer 1 · 2017-01-17T07:44:30.880

7

Another idea,

library(dplyr)
library(tidyr)

df %>% 
 unnest(y) %>% 
 mutate(new = 1) %>% 
 spread(y, new, fill = 0) 

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

Further to the cases you mentioned in comments, we can use dcast from reshape2 as it is more flexible than spread,

df2 <- df %>% 
        unnest(y) %>% 
        group_by(x) %>% 
        filter(!duplicated(y)) %>% 
        ungroup()

reshape2::dcast(df2, x ~ y, value.var = 'y', length)

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

#or with df$x <- c(1, 1, 2, 2, 3)

#  x A B C D E
#1 1 1 1 0 0 0
#2 2 0 1 1 1 0
#3 3 0 0 0 0 1

#or with df$x <- rep(1,5)

#  x A B C D E
#1 1 1 1 1 1 1

edited Jan 17 '17 at 07:44

answered Jan 16 '17 at 09:21

Sotos

51,121
6
32
66

thanks, see what happens when df$x = rep(1, 5). "Error: Duplicate identifiers for rows (1, 2), (3, 5), (4, 7)" – Giora Simchoni Jan 16 '17 at 10:37
What would your expected result be in such case? something like `df %>% unnest(y) %>% group_by(x) %>% mutate(new = 1:n()) %>% spread(y, x, fill = 0)`? – Sotos Jan 16 '17 at 10:48
The same result keeping the original x column. This, on the original `df` gives "Error: Duplicate identifiers for rows (1, 2)". – Giora Simchoni Jan 16 '17 at 11:19
It works on the `df$x = rep(1, 5)` case. On the original `df$x = 1:5` case it gives "Error: Duplicate identifiers for rows (1, 2)". – Giora Simchoni Jan 16 '17 at 11:26
1

Try `mutate(new = 1:n())` before `group_by()` – Sotos Jan 16 '17 at 11:28
It doesn't work with `df$x = 1:5`, neither with `df$x = c(1,1,2,2,3)`. It shouldn't matter what `df$x` is. – Giora Simchoni Jan 17 '17 at 06:41
@GioraSimchoni can you check now? – Sotos Jan 17 '17 at 07:46

score 6 · Accepted Answer · answered Jan 16 '17 at 08:59

6

We can use mtabulate

library(qdapTools)
cbind(df[1], mtabulate(df$y))
#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

answered Jan 16 '17 at 08:59

akrun

874,273
37
540
662

That's impressive and super fast (a few seconds for a ~1M rows with ~350 unique values on my PC). Do you have an answer not requiring a whole new package? Thanks. – Giora Simchoni Jan 16 '17 at 09:19
@GioraSimchoni Looks like somebody else answered it without a package – akrun Jan 16 '17 at 11:00
2

@GioraSimchoni too; I guess a base alternative is `table(rep(df$x, lengths(df$y)), unlist(df$y))`? – alexis_laz Jan 16 '17 at 12:44
Doesn't work with `df$x = rep(1,5)` or `df$x = c(1,1,2,2,3)`. It shouldn't matter what `df$x` is. – Giora Simchoni Jan 17 '17 at 06:42
@GioraSimchoni I am not sure what you meant by doesn't work? It does give an output where the first column is just 1 (for `df$x = rep(1,5)`) – akrun Jan 17 '17 at 06:45
1

Sorry @akrun, I was referring to alexis_laz comment. – Giora Simchoni Jan 17 '17 at 07:23
@GioraSimchoni : (hadn't notice the comment) I misunderstood what you wanted -- in that case, simply use `table(rep(seq_along(df$y), lengths(df$y)), unlist(df$y))` and `cbind` `df$x` as in akrun's answer. (@akrun sorry for the unnecessary notification) – alexis_laz Jan 22 '17 at 13:19

joel.wilson · Answer 3 · 2017-01-17T07:42:23.110

2

this involves no external packages,

# thanks to Sotos for suggesting to use `unique(unlist(df$y))` instead of `LETTERS[1!:5]`
sapply(unique(unlist(df$y)), function(j) as.numeric(grepl(j, df$y)))
#     A B C D E
#[1,] 1 0 0 0 0
#[2,] 1 1 0 0 0
#[3,] 0 0 1 0 0
#[4,] 0 1 1 1 0
#[5,] 0 0 0 0 1

edited Jan 17 '17 at 07:42

answered Jan 16 '17 at 09:29

joel.wilson

8,243
5
28
48

2

the `LETTERS` part is bad. You can do `unique(unlist(df$y))` instead – Sotos Jan 16 '17 at 09:41
Doesn't work with `df$x = rep(1,5)` or `df$x = c(1,1,2,2,3)`. It shouldn't matter what `df$x` is. – Giora Simchoni Jan 17 '17 at 06:42
1

@joel.wilson works great, I'll make some benchmarks to see how it compares with other "fancier" solutions, thanks. – Giora Simchoni Jan 17 '17 at 07:46
@GioraSimchoni how does it perform? – joel.wilson Jan 17 '17 at 17:27

R: create dummy variables based on a categorical variable of lists

3 Answers3

Linked

R: create dummy variables based on a categorical variable *of lists*

3 Answers3

Linked

R: create dummy variables based on a categorical variable of lists