I have a big (12 million rows) data.table
which looks like this:
library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
id y
1: 1 b
2: 1 d
3: 1 c
4: 1 e
5: 1 e
6: 2 a
7: 2 c
8: 2 e
9: 2 c
10: 2 c
11: 3 e
12: 3 c
13: 3 d
14: 3 c
15: 3 a
I want to create a new data.table
containing my variable id
(which will be the unique key of this new data.table
) and 5 other binary variables each one corresponding to each category of y
which take value 1
if the id has that value for y
, 0
otherwise.
The output data.table
should look like this:
id a b c d e
1: 1 0 1 1 1 1
2: 2 1 0 1 0 1
3: 3 1 0 1 1 1
I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".
EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2
package.
I was wondering if there's another (faster) way to do so by maybe using the :=
operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.
EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y
variable (resulting in 490 dummy variables)):
system.time(ans1 <- AnsFunction()) # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC