5

I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name

I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set?

Here is an example

dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday",
                          "Thursday", "Friday", "Saturday", "Sunday"))

Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
        dt[, Day[i] := ifelse(Week_Day == Day[i], 1, 0)]
    }
}

my table is 298k rows and although it doesn't take long to execute (below), its part of a long script and I have quite a few inefficient loops so I am trying to get the overall run time down.

Time to run:

user  system elapsed
0.99    0.06    1.05

Thanks in advance.

smci
  • 32,567
  • 20
  • 113
  • 146
MidnightDataGeek
  • 938
  • 12
  • 21

2 Answers2

7

Here's a different approach that, performs better - on my machine - than the original approach in the question

1) Get unique days except Sunday

Day <- setdiff(dt$Week_Day, "Sunday")

2) Initialize new columns with 0:

dt[, (Day) := 0L]

3) Update with 1s by reference in a loop:

for(x in Day) {
  set(dt, i = which(dt[["Week_Day"]] == x), j = x, value = 1L)
}

Simple performance comparison:

dt1 <- data.table(Week_Day = sample(c("Monday", "Tuesday", "Wednesday",
                              "Thursday", "Friday", "Saturday", "Sunday"), 3e5, TRUE))

dt2 <- copy(dt1)


system.time({
  Day <- setdiff(unique(dt$Week_Day), "Sunday")
  dt1[, (Day) := 0L]
  for(x in Day) {
    set(dt1, i = which(dt1[["Week_Day"]] == x), j = x, value = 1L)
  }
})
#       User      System verstrichen 
#      0.029       0.003       0.032 

system.time({
  Day <- unique(dt$Week_Day)
  for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
      dt2[, Day[i] := ifelse(Week_Day == Day[i], 1L, 0L)]
    }
  }
})

#       User      System verstrichen 
#      0.138       0.070       0.210 


all.equal(dt1, dt2)
#[1] TRUE
talat
  • 68,970
  • 21
  • 126
  • 157
  • 1
    Amazing!! On my data of 300k rows the execution time was: `user system elapsed` `0.02 0.00 0.01` This leads me onto posting another question about a nested loop that is taking 'too' long. – MidnightDataGeek Jul 22 '16 at 12:47
  • 1
    probably possible speed up could be from using index on `Week_Day` as it seems we subset on it multiple times. – jangorecki Jul 24 '16 at 10:58
  • @docendo I've used your solution to add seasonal variables too but I omitted the brackets around 'Day' when adding the new variable by mistake `dt1[, (Day) := 0L]` Doing this meant the 'Day' variable remained and I had NA's where I would have expected a zero. Do you know what adding the brackets does? Just so I can understand exactly what is happening. Thanks – MidnightDataGeek Jul 25 '16 at 08:55
  • @MidnightDataGeek By adding brackets, the expression inside is evaluated. Otherwise, the column is just called 'Day'. You can have a look at the vignettes for data.table on github for more information – talat Jul 25 '16 at 09:12
4

Here is one attempt at a speed up:

Day <- unique(dt$Week_Day)
setkey(dt, Week_Day)

# create columns of 0s
dt[, (Day) := 0L]

for (i in seq_along(head(Day, -1))) {
     dt[Day[i], Day[i] := 1L]
}

This implements a couple of the data.table speed ups including binary search in the second chain and the elimination of ifelse with replacement by reference.

Arun
  • 116,683
  • 26
  • 284
  • 387
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Thanks, @arun. I didn't know that it was possible to "vectorize" the creation of variables with `:=` in such a manner. Really cool technique. – lmo Jul 22 '16 at 12:42
  • 1
    Sure, RHS is just recycled first to the length of LHS, and then to number of rows.. (for convenience). – Arun Jul 22 '16 at 12:43
  • @Arun thanks! This was also blazingly fast on my machine. `user system elapsed` `0.03 0.00 0.03` I now know my script must be very inefficient because I have quite a few nested loops. I shall be posting another question at some point. Thanks for your help :) – MidnightDataGeek Jul 22 '16 at 12:55