12

I came across the following problem today and I am wondering if there is a better way to accomplish what I am trying to do.

Let's suppose I have the following data.table (just an hourly timestamp):

library(data.table)
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))

> tdt
                  Timestamp
     1: 1980-01-01 00:00:00
     2: 1980-01-01 01:00:00
     3: 1980-01-01 02:00:00
     4: 1980-01-01 03:00:00
     5: 1980-01-01 04:00:00
    ---                    
306813: 2014-12-31 20:00:00
306814: 2014-12-31 21:00:00
306815: 2014-12-31 22:00:00
306816: 2014-12-31 23:00:00
306817: 2015-01-01 00:00:00

My goal is to change the minutes of the timestamp to, say, 10 minutes.

I know I can use:

library(lubridate)
minute(tdt$Timestamp) <- 10

but this does not utilize the super fast speed of data table (which I need). On my laptop this took:

> system.time(minute(tdt$Timestamp) <- 10)
   user  system elapsed 
  11.29    0.16   11.45 

So, my question is: Can we somehow use a replacement function in the data table syntax so that it will do what I want using data.table 's speed? If the answer is no, any other data.table solution to do this fast, would be acceptable.

If you wonder one of the things I tried is:

tdt[, Timestamp2 := minute(Timestamp) <- 10]

which does not work.

Expected Output (but with data table syntax):

> tdt
                  Timestamp
     1: 1980-01-01 00:10:00
     2: 1980-01-01 01:10:00
     3: 1980-01-01 02:10:00
     4: 1980-01-01 03:10:00
     5: 1980-01-01 04:10:00
    ---                    
306813: 2014-12-31 20:10:00
306814: 2014-12-31 21:10:00
306815: 2014-12-31 22:10:00
306816: 2014-12-31 23:10:00
306817: 2015-01-01 00:10:00
Community
  • 1
  • 1
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • 1
    I *think* this may point you in the right direction: http://stackoverflow.com/questions/10449366/levels-what-sorcery-is-this . But at least as of last year `data.table` didn't support date/time formats well. May have improved by now. – Ari B. Friedman Jul 13 '15 at 19:54
  • 6
    How about just `tdt[, Timestamp := Timestamp + 600L]`? – David Arenburg Jul 13 '15 at 19:58
  • @AriB.Friedman Thanks for this link. I have seen this question and I am aware of how a replacement function works, but I was still not able to use it like that somehow... – LyzandeR Jul 13 '15 at 20:04
  • @DavidArenburg This is great Dave thanks so much. I will use this solution to be honest tomorrow. Just out of curiosity I want to see whether there is a way to use a generic replacement function in conjunction with the `data.table` syntax. Feel free to post it as an answer. I will accept if no one else provides a solution for a replecement function + data.table syntax. – LyzandeR Jul 13 '15 at 20:06
  • Have you benchmark-ed on your real data or you will be able only tomorrow? – David Arenburg Jul 13 '15 at 20:09
  • @DavidArenburg I am using a much more complex function on my real data set (which has about 3m rows). The whole function takes ages (more than 2-3 minutes). Your suggestion takes half a sec on my sample data so I assume it will be much faster. – LyzandeR Jul 13 '15 at 20:13

3 Answers3

11

A POSIXct object is just a double with some attributes

storage.mode(as.POSIXct("1980-01-01 00:00:00"))
## [1] "double"

So in order to manipulate it efficiently you can just treat it as one, for instance

tdt[, Timestamp := Timestamp + 600L]

Will add 600 seconds (10 minutes) to each row by reference


Some benchmarks

tdt <- data.table(Timestamp = seq(as.POSIXct("1600-01-01 00:00:00"), 
                                  as.POSIXct("2015-01-01 00:00:00"), 
                                  '1 hour'))
system.time(minute(tdt$Timestamp) <- 10)
# user  system elapsed 
# 124.86    1.95  127.68 
system.time(set(tdt, j = 1L, value = `minute<-`(tdt$Timestamp, 10)))
# user  system elapsed 
# 124.99    1.83  128.25 
system.time(tdt[, Timestamp := Timestamp + dminutes(10)])
# user  system elapsed 
# 0.39    0.04    0.42 
system.time(tdt[, Timestamp := Timestamp + 600L])
# user  system elapsed 
# 0.01    0.00    0.01 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Thanks a lot for this answer Dave. I just modified the core of my function (according to this answer) and applied it on this sample time stamp and it works in less than a second. It is extremely fast and accurate. Thank you so much you saved me a lot of time of waiting for things to run. I wish I could accept both answers. One showed me a way to use a replacement function (which although doesn't seem to work great, it answers directly my question), the other one helped me solve my actual problem. I 'm confused :S . Stack overflow needs to do sth for such occasions! – LyzandeR Jul 13 '15 at 21:08
  • 1
    It's fine, Ari provided a nice answer and deserves the tick. – David Arenburg Jul 13 '15 at 21:10
  • 1
    Thanks a lot Dave, you made me feel better :). Thanks for all your help and the actual solution. – LyzandeR Jul 13 '15 at 21:22
  • 1
    @LyzandeR also perhaps keep in mind `IDateTime` (or similar `IDate`/`ITime`) classes which work like `Date`/`POSIXct` but are stored as integers for efficiency. – MichaelChirico Feb 28 '16 at 17:52
7

Replacement functions are run in two steps:

  1. A function that creates the desired output,
  2. That output is then assigned to the result.

You can run step 1 without running step 2. That result can then be used to set the data.table column (set used here but you could use := as well).

library(lubridate)
library(data.table)
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))
minute(tdt$Timestamp) <- 20
print( `minute<-`(tdt$Timestamp,11) )
set( tdt, j=1L,value=`minute<-`(tdt$Timestamp,11)  )

Edit: Small data.table vs. big data.table benchmarking

library(lubridate)
library(data.table)
library(microbenchmark)

# Config
tms <- 5L

# Sample data, 1 column
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))
minute(tdt$Timestamp) <- 20

tdf <- as.data.frame( tdt )


# Sample data, lots of columns
bdf <- cbind( tdf, as.data.frame( replicate( 100, runif(nrow(tdt)) ) ) )
bdt <- as.data.table( bdf )

# Benchmark
microbenchmark(
  `minute<-`(tdt$Timestamp,10), # How long does the operation to generate the new vector itself take?
  set( tdt, j=1L,value=`minute<-`(tdt$Timestamp,11)  ), # One column: How long does it take to generate the new vector and replace the contents in the data.table?
  minute( tdf$Timestamp ) <- 12, # One column: How long does it take to do it with a data.frame?
  set( tdt, j=1L,value=`minute<-`(bdt$Timestamp,13)  ), # Many columns: How long does it take to generate the new vector and replace the contents in the data.table?
  minute( bdf$Timestamp ) <- 14, #  Many columns: How long does it take to do it with a data.frame?
  times = tms
)

Unit: seconds
                                                    expr      min       lq     mean   median       uq      max neval
                           `minute<-`(tdt$Timestamp, 10) 1.304388 1.385883 1.417616 1.389316 1.459166 1.549327     5
 set(tdt, j = 1L, value = `minute<-`(tdt$Timestamp, 11)) 1.314495 1.344277 1.376241 1.352124 1.389083 1.481225     5
                             minute(tdf$Timestamp) <- 12 1.342104 1.349231 1.488639 1.378840 1.380659 1.992358     5
 set(tdt, j = 1L, value = `minute<-`(bdt$Timestamp, 13)) 1.337944 1.383429 1.402802 1.418211 1.418922 1.455503     5
                             minute(bdf$Timestamp) <- 14 1.332482 1.333713 1.355331 1.335728 1.342607 1.432127     5

Looks like it is no faster, which belies my understanding of what is going on. Strange.

Community
  • 1
  • 1
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • That's nice, I wonder how will it scale though. From my benchmark it was exactly the same timing – David Arenburg Jul 13 '15 at 20:33
  • This is fantastic! David's answer (I have upvoted Dave :P) is much faster to be honest for the particular example I have set but in all fairness this is the generic answer I was looking for. Thank you so much Ari. – LyzandeR Jul 13 '15 at 20:36
  • Since your example consists of a single column, I would expect that replacing the column vs. replacing the entire data.frame would be similar in speed. Try benchmarking it with a data.table with a few hundred columns? – Ari B. Friedman Jul 13 '15 at 20:36
  • My pleasure. I, too, upvoted David's lovely answer, but wanted to provide this solution in case your problem required more generic mechanisms. – Ari B. Friedman Jul 13 '15 at 20:39
  • It looks like this: `minute(tdt$Timestamp) <- 10` and this: `set( tdt, j='timestamp2',value=`minute<-`(df$timestamp, 11) )` take exactly the same time to run. `user system elapsed 10.72 0.10 10.81` on my laptop. It makes me think that there is actually no way to use a replacement function effectively on a data.table. – LyzandeR Jul 13 '15 at 20:45
  • I was wrong. No faster when big data.tables are used. @MatthewDowle any clues as to why it's all the same speed? Is this just a case where data.frame is being particularly efficient? Regardless, @LyzandeR, the key here is that it's `minute` that takes awhile. Optimize your efforts there, rather than on avoiding a whole-data-table copy (which is where data.table excels). – Ari B. Friedman Jul 13 '15 at 21:02
  • 1
    Thanks a lot Ari. This is a great piece of analysis, I appreciate you taking the time and effort to help. I will be using @DavidArenburg 's idea to avoid using `minute` which I already tested and works great. I am still curious to see why there doesn't seem to be a difference between data.frame and data.table when using a replacement function. Maybe Matthew can shed some light here. – LyzandeR Jul 13 '15 at 21:21
3

I guess this should do the trick for you:

library(data.table)
library(lubridate)

tdt <- data.table(
  Timestamp = seq(as.POSIXct("1980-01-01 00:00:00")
  , as.POSIXct("2015-01-01 00:00:00")
  , '1 hour'))
tdt[, Timestamp := Timestamp + dminutes(10)]
Alexander Radev
  • 652
  • 5
  • 11
  • Thanks for the answer. This is practically equivalent to Dave's answer and it will be slightly slower since it is using an external function. – LyzandeR Jul 13 '15 at 20:47
  • Yes, indeed it is equivalent. But I did not see it before I posted :) Yet I like mine a bit better due to the clearer understanding of what is going on. – Alexander Radev Jul 13 '15 at 20:54