Speed up this loop to create dummy columns with data.table and set in R

Question

I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name

I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set?

Here is an example

dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday",
                          "Thursday", "Friday", "Saturday", "Sunday"))

Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
        dt[, Day[i] := ifelse(Week_Day == Day[i], 1, 0)]
    }
}

my table is 298k rows and although it doesn't take long to execute (below), its part of a long script and I have quite a few inefficient loops so I am trying to get the overall run time down.

Time to run:

user  system elapsed
0.99    0.06    1.05

Thanks in advance.

Do look at `model.matrix()` as well. Perhaps by usage you want something like that. — abhiieor, Jul 22 '16 at 13:26

talat · Accepted Answer · 2016-07-22T12:42:04.770

7

Here's a different approach that, performs better - on my machine - than the original approach in the question

1) Get unique days except Sunday

Day <- setdiff(dt$Week_Day, "Sunday")

2) Initialize new columns with 0:

dt[, (Day) := 0L]

3) Update with 1s by reference in a loop:

for(x in Day) {
  set(dt, i = which(dt[["Week_Day"]] == x), j = x, value = 1L)
}

Simple performance comparison:

dt1 <- data.table(Week_Day = sample(c("Monday", "Tuesday", "Wednesday",
                              "Thursday", "Friday", "Saturday", "Sunday"), 3e5, TRUE))

dt2 <- copy(dt1)


system.time({
  Day <- setdiff(unique(dt$Week_Day), "Sunday")
  dt1[, (Day) := 0L]
  for(x in Day) {
    set(dt1, i = which(dt1[["Week_Day"]] == x), j = x, value = 1L)
  }
})
#       User      System verstrichen 
#      0.029       0.003       0.032 

system.time({
  Day <- unique(dt$Week_Day)
  for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
      dt2[, Day[i] := ifelse(Week_Day == Day[i], 1L, 0L)]
    }
  }
})

#       User      System verstrichen 
#      0.138       0.070       0.210 


all.equal(dt1, dt2)
#[1] TRUE

edited Jul 22 '16 at 12:42

answered Jul 22 '16 at 12:35

talat

68,970
21
126
157

1

Amazing!! On my data of 300k rows the execution time was: `user system elapsed` `0.02 0.00 0.01` This leads me onto posting another question about a nested loop that is taking 'too' long. – MidnightDataGeek Jul 22 '16 at 12:47
1

probably possible speed up could be from using index on `Week_Day` as it seems we subset on it multiple times. – jangorecki Jul 24 '16 at 10:58
@docendo I've used your solution to add seasonal variables too but I omitted the brackets around 'Day' when adding the new variable by mistake `dt1[, (Day) := 0L]` Doing this meant the 'Day' variable remained and I had NA's where I would have expected a zero. Do you know what adding the brackets does? Just so I can understand exactly what is happening. Thanks – MidnightDataGeek Jul 25 '16 at 08:55
@MidnightDataGeek By adding brackets, the expression inside is evaluated. Otherwise, the column is just called 'Day'. You can have a look at the vignettes for data.table on github for more information – talat Jul 25 '16 at 09:12

score 4 · Answer 2 · edited Jul 22 '16 at 12:41

4

Here is one attempt at a speed up:

Day <- unique(dt$Week_Day)
setkey(dt, Week_Day)

# create columns of 0s
dt[, (Day) := 0L]

for (i in seq_along(head(Day, -1))) {
     dt[Day[i], Day[i] := 1L]
}

This implements a couple of the data.table speed ups including binary search in the second chain and the elimination of ifelse with replacement by reference.

edited Jul 22 '16 at 12:41

Arun

116,683
26
284
387

answered Jul 22 '16 at 12:33

lmo

37,904
9
56
69

Thanks, @arun. I didn't know that it was possible to "vectorize" the creation of variables with `:=` in such a manner. Really cool technique. – lmo Jul 22 '16 at 12:42
1

Sure, RHS is just recycled first to the length of LHS, and then to number of rows.. (for convenience). – Arun Jul 22 '16 at 12:43
@Arun thanks! This was also blazingly fast on my machine. `user system elapsed` `0.03 0.00 0.03` I now know my script must be very inefficient because I have quite a few nested loops. I shall be posting another question at some point. Thanks for your help :) – MidnightDataGeek Jul 22 '16 at 12:55

Speed up this loop to create dummy columns with data.table and set in R

2 Answers2

Linked