How to programmatically create binary columns based on a categorical variable in data.table?

Question

I have a big (12 million rows) data.table which looks like this:

library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
    id y
 1:  1 b
 2:  1 d
 3:  1 c
 4:  1 e
 5:  1 e
 6:  2 a
 7:  2 c
 8:  2 e
 9:  2 c
10:  2 c
11:  3 e
12:  3 c
13:  3 d
14:  3 c
15:  3 a

I want to create a new data.table containing my variable id (which will be the unique key of this new data.table) and 5 other binary variables each one corresponding to each category of y which take value 1 if the id has that value for y, 0 otherwise.
The output data.table should look like this:

   id a b c d e
1:  1 0 1 1 1 1
2:  2 1 0 1 0 1
3:  3 1 0 1 1 1

I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".

EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2 package.
I was wondering if there's another (faster) way to do so by maybe using the := operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.

EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y variable (resulting in 490 dummy variables)):

system.time(ans1 <- AnsFunction())   # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC

I notice there are similar rows such as four and five, can you explain this data a little better? As I understand it `data[1][e]=1 if(2>0) else 0` but it just seems a little weird. — kpie, Jun 10 '16 at 07:25
Possible duplicate of [How to use cast or another function to create a binary table in R](http://stackoverflow.com/questions/11659128/how-to-use-cast-or-another-function-to-create-a-binary-table-in-r) — mtoto, Jun 10 '16 at 07:29
@kpie I edited the second `data.table`, it should be clearer now: the `id` n.1 has the distinc values `b,c,d,e` for `y`, but not `a`. This explains why his row on the second `data.table` has `1` everywhere except for the `a` column. @mtoto thanks for your answer, this would solve my provlem, but with such massive data I was wondering if there was another way to do the same thing but inside `data.table`, maybe with the `:=` operator. — hellter, Jun 10 '16 at 07:34
If you want to use `data.table`, you could go with `dcast()`: `dcast(dt, id ~ y,fun.aggregate = function(x) (length(x) > 0)+0)` — mtoto, Jun 10 '16 at 07:56
You might, also, consider having your 1/0 in a "matrix", probably sparse to have a chance of saving some memory -- `uy = unique(dt$y); m = matrix(0L, max(dt$id), length(uy), dimnames = list(NULL, uy)); m[cbind(dt$id, match(dt$y, uy))] = 1L` — alexis_laz, Jun 10 '16 at 13:49
@alexis_laz I will try your approach to see how it stacks up against the others as soon as I can gain access to the same machine I used for the other benchmarks. — hellter, Jun 10 '16 at 14:09

Arun · Accepted Answer · 2016-06-10T09:51:56.013

7

data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
#    id a b c d e
# 1:  1 0 1 1 1 1
# 2:  2 1 0 1 0 1
# 3:  3 1 0 1 1 1

Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]

All that's left is to fill existing combinations with 1L.

dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
#    id b d c e a
# 1:  1 1 1 1 1 0
# 2:  2 0 0 1 1 1
# 3:  3 0 1 1 1 1

Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction())   # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

where,

AnsFunction <- function() {
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    # reorder columns outside
}

dcastFunction <- function() {
    # no need to load reshape2. data.table has its own dcast as well
    # no need for setDT
    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
    # need to return integer results for identical results
    # fixed 1 -> 1L; as.numeric -> as.integer
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1L] <- 1L
    df <- cbind(id = as.integer(row.names(df)), df)
    setDT(df)
}

edited Jun 10 '16 at 09:51

answered Jun 10 '16 at 07:58

Arun

116,683
26
284
387

Your approach looks like exactly what I was looking for. I get the sense, but when I run the code of your second approach on `dt` it doesn't work and I get `Empty data.table (0 rows) of 1 col: id` – hellter Jun 10 '16 at 08:02
@helter, could you edit your Q to show a benchmark of the run time between the two methods posted above on your original data? – Arun Jun 10 '16 at 08:05
@Tobias Dekker just provided a benchmark in his answer – hellter Jun 10 '16 at 08:57
@hellter, I was specifically interested in data of *your* dimensions. It's as simple as wrapping both the functions I've provided with `system.time()`. Not sure why that's an issue. Anyhow, I've gone ahead and added a benchmark on data close to your dimensions. – Arun Jun 10 '16 at 09:48
1

That's not an issue at all, I just couldn't do it before and I thought @Tobias' benchmark was enough. I just added the benchmark in the question. – hellter Jun 10 '16 at 10:34
1

Awesome, thanks. I plan to work on improving `dcast` for next release. Definitely helps in knowing how not to go about improving `dcast()`. – Arun Jun 10 '16 at 10:44
2

I think that the slowest part in `TableFunction` is `table(dt$id, dt$y)`. In fact working on this dataset I noticed that, in general, `table()` is **extremely** slow, maybe because I have so many `id`s. For this reason, in general I tend to use `data.table`'s `.N` operator in the `j` argument while subsetting `by=id`. Maybe changing that bit inside `TableFunction` would improve performance (?), but I don't see how to obtain the same output of the first line of `TableFunction` without `table()` – hellter Jun 10 '16 at 10:52
Mmh okay I have to admit that dcast is a better option for large data sets. For some reasons it scales better for larger data sets. I added the benchmark on the larger set to my post. Not sure if it possible to improve the table function. – Tobias Dekker Jun 10 '16 at 11:08

Tobias Dekker · Answer 2 · 2016-06-10T11:15:12.057

For small data sets the table function seems to be more efficient, but on large datasets dcast seems to be the most efficient and convenient option.

TableFunction <- function(){
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1] <- 1
    df <- cbind(id = as.numeric(row.names(df)), df)
    setDT(df)
}


AnsFunction <- function(){
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=id, j=unique(y), value=1L); NULL}, by=id]
}

dcastFunction <- function(){
    df <-dcast.data.table(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")

}

library(data.table)
library(microbenchmark)
set.seed(123)
N = 10000
dt <- data.table(id=rep(1:N, each=5),y=sample(letters[1 : 5], N*5, replace = T)) 


microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
    )


 Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 dcast  42.48367  45.39793  47.56898  46.83755  49.33388  60.72327   100  b 
 Table  28.32704  28.74579  29.14043  29.00010  29.23320  35.16723   100 a  
   Ans 120.80609 123.95895 127.35880 126.85018 130.12491 156.53289   100   c

> all(test1 == test2)
[1] TRUE
> all(test1 == test3)
[1] TRUE

y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
)
Unit: seconds
  expr      min       lq     mean   median       uq      max neval cld
 dcast 1.985969 2.064964 2.189764 2.216138 2.266959 2.643231   100 a  
 Table 5.022388 5.403263 5.605012 5.580228 5.830414 6.318729   100   c
   Ans 2.234636 2.414224 2.586727 2.599156 2.645717 2.982311   100  b

I've added a benchmark on larger data to my post. I'm not sure if you're running data.table's dcast or reshape2's, since you use `setDT()`, which won't be necessary if you use data.table's. And reshape2::dcast is *slow*. — Arun, Jun 10 '16 at 09:47
Instead of `table` + `[<-.data.frame`, al alternative is `uy = unique(dt$y); m = matrix(0L, max(dt$id), length(uy), dimnames = list(NULL, uy)); m[cbind(dt$id, match(dt$y, uy))] = 1L` — alexis_laz, Jun 10 '16 at 13:51

score -1 · Answer 3 · answered Jun 10 '16 at 07:42

If you already know the range of the rows (as in you know that there are no more than 3 rows in your example) and you know the columns you can start with an array of zeros and use the apply function to update values in that secondary table.

My R is a little rust but i think that should work. Additionally the function you pass to the apply method could contain conditions to add necessary rows and columns as is needed.

My R is a little rust so I'm a bit tentative to write it up right now, but I think that's the way to do it.

If you are looking for something a little more plug and play I found this little blerb:

There are two sets of methods that are explained below:

gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.

melt() and dcast() from the reshape2 package.

There are a number of other methods which aren’t covered here, since they are not as easy to use:

The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.

stack() and unstack()

from here :: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

If I was better versed in R I would tell you how those various methods handle collisions going from long lists to wide on. I was googling up "Make a table from flat data in R" to come up with this...

Also Check out this It's that same website as above with my personal comment wrapper : p

How to programmatically create binary columns based on a categorical variable in data.table?

3 Answers3

Linked

Related