Create dummy variables in one table based on range of dates in another table

Question

I have two tables. table1 looks like this

  date       hour     data
2010-05-01     3        5
2010-05-02     7        7
2010-05-02     10       8
2010-07-03     18       3
2011-12-09     22       1
2012-05-01     3        0

This is stored as a data.table with key set on date and hour. I have another table, that looks like this. It's my outages table.

 resource        date_out                date_back
   joey       2010-04-30 4:00:00      2010-05-02 8:30:00
   billy      2009-04-20 7:00:00      2009-02-02 5:30:00
   bob        2011-11-15 12:20:00     2010-12-09 23:00:00
   joey       2012-04-28 1:00:00      2012-05-02 17:00:00

I want to add columns to table1 where those columns are the resource from the outages table. I want the values in those columns to be 0 for whenever there isn't an outage and 1 for when there is.

The result for this example should be.

  date       hour     data     joey      billy      bob
2010-05-01     3        5       1          0         0        
2010-05-02     7        7       1          0         0 
2010-05-02     10       8       0          0         0 
2010-07-03     18       3       0          0         0 
2011-12-09     22       1       0          0         1
2012-05-01     3        0       1          0         0

In actuality my table1 has about 2500 rows and my outages table has 19000. The only way I could think to do this is to loop through each row of the outages table and then insert 1s into table1 in the correct places. My code relies on table1 being in order so at least it doesn't have to scan 100% of that table for every row of outages. However the below takes over 4 hours for my data.

for (out in 1:length(outages$resource)) {
  a<-as.character(outages[out]$resource)
  #if column doesn't exist then create it
  if (a %in% colnames(table1)==FALSE) {
    table1$new<-0
    setnames(table1, "new", a)
    }
  midpoint<-round(length(table1$date)/2,0)
  if (table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages    [out]$due_back)
  {
    while(table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
      table1[midpoint,a:=1,with=FALSE]
      midpoint<-midpoint-1
    }
    midpoint<-round(length(table1$date)/2,0)
    while(table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
      table1[midpoint,a:=1,with=FALSE]
      midpoint<-midpoint+1
    }
  } else {
    if (table1$date[midpoint]+table1$hour[midpoint]*60*60>outages[out]$due_back) {
      while(table1$date[midpoint]+table1$hour[midpoint]*60*60>outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
        midpoint<-midpoint-1
      }
      while(table1$date[midpoint]+table1$hour[midpoint]*60*60>=outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
        table1[midpoint,a:=1,with=FALSE]
        midpoint<-midpoint-1
      }
    } 
    midpoint<-round(length(table1$date)/2,0)
    if (table1$date[midpoint]+table1$hour[midpoint]*60*60<outages[out]$due_out) {
      while(table1$date[midpoint]+table1$hour[midpoint]*60*60<outages[out]$due_out && midpoint>=1 && midpoint<=length(table1$date)) {
        midpoint<-midpoint+1
      }
      while(table1$date[midpoint]+table1$hour[midpoint]*60*60<=outages[out]$due_back && midpoint>=1 && midpoint<=length(table1$date)) {
        table1[midpoint,a:=1,with=FALSE]
        midpoint<-midpoint+1
 }
 }
 }
if (sum(table1[,a,with=FALSE])==0) {
  table1[,a:=NULL,with=FALSE]
}
}

To quote everybody's favorite infomercial line "There's got to be a better way".

Could you post the `str` output for your data, so we can see the data types? Or better yet, the `dput` output? — Matthew Plourde, May 07 '13 at 16:19
please provide `dput` output. It's hard to copy/paste columns with spaces (the ones between yyyy-mm-dd and hh:mm:ss). — Arun, May 07 '13 at 16:32

score 2 · Accepted Answer · answered May 07 '13 at 18:13

2

Here's a way of achieving what you want. This assumes your table1's time precision is 1 hour. Though it can be modified to an arbitrary precision, it will perform much better for larger time intervals as it constructs the full sequence of possible times in the date_out-date_back range. Note, I used slightly different tables from OP to illustrate overlapping intervals and to correct some mistakes in OP.

table1 = data.table(date = c("2010-05-01", "2010-05-02", "2010-05-02", "2010-07-03", "2011-12-09", "2012-05-01"), hour = c(3,7,10,18,22,3), data = c(5,7,8,3,1,0))
outages = data.table(resource = c("joey", "bob", "billy", "bob", "joey"), date_out = c("2010-04-30 4:00:00", "2010-04-30 4:00:00", "2009-04-20 7:00:00", "2011-11-15 12:20:00", "2012-04-28 1:00:00"), date_back=c("2010-05-02 8:30:00", "2010-05-02 8:30:00", "2009-06-02 5:30:00", "2011-12-09 23:00:00", "2012-05-02 17:00:00"))

# round up date_out and round down date_back
# and create a sequence in-between spaced by 1 hour
outages[, list(datetime = seq(as.POSIXct(round(as.POSIXct(date_out) + 30*60-1, "hours")),
                              as.POSIXct(round(as.POSIXct(date_back) - 30*60, "hours")),
                              60*60)),
          by = list(resource, date_out)] -> outages.expanded
setkey(outages.expanded, datetime)

# merge with the original table, then run "table" to get the frequencies/occurences
# and cbind back with the original table
cbind(table1, unclass(table(
                outages.expanded[table1[, list(datetime=as.POSIXct(paste0(date, " ", hour, ":00:00")))],
                                 resource])))

#         date hour data bob joey
#1: 2010-05-01    3    5   1    1
#2: 2010-05-02    7    7   1    1
#3: 2010-05-02   10    8   0    0
#4: 2010-07-03   18    3   0    0
#5: 2011-12-09   22    1   1    0
#6: 2012-05-01    3    0   0    1

answered May 07 '13 at 18:13

eddi

49,088
6
104
155

Ok this looks like it will meet my needs perfectly. I successfully reproduced this with just the made up data. However when I try to do this with my real data I get an error with the line that produces outages.expanded. The error is Error in seq.POSIXt(as.POSIXct(round(as.POSIXct(date_out) + 30 * 60 - 1, : 'to' must be of length 1 I'm not sure what I should be looking for to fix this. Any pointers on that error? – Dean MacGregor May 07 '13 at 20:55
@DeanMacGregor you have duplicate (`resource`,`date_out`) pairs. You can either remove the duplicates, or if you want to keep them, do the 'by' by row names instead `by=row.names(outages)` – eddi May 07 '13 at 21:02
@DeanMacGregor after some thought - you'd need to modify it a bit to get it to work with row names, I recommend getting rid of duplicates instead, as they shouldn't be there anyway from your data description – eddi May 07 '13 at 21:05
the modification mentioned above is to add `resource` to the list in `outages`: `outages[, list(resource=resource,datetime=seq...), by = row.names(outages)]` – eddi May 07 '13 at 21:22
I just did `setkey(table1, resource, date_out)` `table1<-unique(table1)` which fixed that. Now that I've got your method instead of mine I can do this in less than 1 minute instead of hours so that's is awesome...Thanks. – Dean MacGregor May 07 '13 at 21:56
1

Hi. This is better done with `roll=TRUE`, iiuc. Set a key on outages$date_out. roll join from table1 to it. Keep the join results where the table1.date < outages.date_back, too, and done. It should be sub 1 second for this on this size. For further examples of roll joins try searching for '[data.table] "roll"'. – Matt Dowle Jun 10 '13 at 08:17
1

+1 @MatthewDowle - I was too lazy to replace this answer, but here's a very similar question where I used the `roll` argument instead of the above: http://stackoverflow.com/questions/16649885/replacing-nested-loop-in-r/16652700#16652700 – eddi Jun 10 '13 at 15:29

Create dummy variables in one table based on range of dates in another table

1 Answers1

Linked