efficiently generate a random sample of times and dates between two dates

Question

I have written a (fairly naive) function to randomly select a date/time between two specified days

# set start and end dates to sample between
day.start <- "2012/01/01"
day.end <- "2012/12/31"

# define a random date/time selection function
rand.day.time <- function(day.start,day.end,size) {
  dayseq <- seq.Date(as.Date(day.start),as.Date(day.end),by="day")
  dayselect <- sample(dayseq,size,replace=TRUE)
  hourselect <- sample(1:24,size,replace=TRUE)
  minselect <- sample(0:59,size,replace=TRUE)
  as.POSIXlt(paste(dayselect, hourselect,":",minselect,sep="") )
}

Which results in:

> rand.day.time(day.start,day.end,size=3)
[1] "2012-02-07 21:42:00" "2012-09-02 07:27:00" "2012-06-15 01:13:00"

But this seems to be slowing down considerably as the sample size ramps up.

# some benchmarking
> system.time(rand.day.time(day.start,day.end,size=100000))
   user  system elapsed 
   4.68    0.03    4.70 
> system.time(rand.day.time(day.start,day.end,size=200000))
   user  system elapsed 
   9.42    0.06    9.49

Is anyone able to suggest how to do something like this in a more efficient manner?

Dirk Eddelbuettel · Accepted Answer · 2013-02-06T12:50:51.007

48

Ahh, another date/time problem we can reduce to working in floats :)

Try this function

R> latemail <- function(N, st="2012/01/01", et="2012/12/31") {
+     st <- as.POSIXct(as.Date(st))
+     et <- as.POSIXct(as.Date(et))
+     dt <- as.numeric(difftime(et,st,unit="sec"))
+     ev <- sort(runif(N, 0, dt))
+     rt <- st + ev
+ }
R>

We compute the difftime in seconds, and then "merely" draw uniforms over it, sorting the result. Add that to the start and you're done:

R> set.seed(42); print(latemail(5))     ## round to date, or hour, or ...
[1] "2012-04-14 05:34:56.369022 CDT" "2012-08-22 00:41:26.683809 CDT" 
[3] "2012-10-29 21:43:16.335659 CDT" "2012-11-29 15:42:03.387701 CST"
[5] "2012-12-07 18:46:50.233761 CST"
R> system.time(latemail(100000))
   user  system elapsed 
  0.024   0.000   0.021 
R> system.time(latemail(200000))
   user  system elapsed 
  0.044   0.000   0.045 
R> system.time(latemail(10000000))   ## a few more than in your example :)
   user  system elapsed 
  3.240   0.172   3.428 
R>

edited Feb 06 '13 at 12:50

answered Feb 06 '13 at 03:44

Dirk Eddelbuettel

360,940
56
644
725

10

First rule of working with dates and times: *always* remember that `POSIXct` is really just a numeric with fractional seconds since theepoch. Dito for `Date` and fractional days. A lot of problems become a *lot* easier that way. – Dirk Eddelbuettel Feb 06 '13 at 03:57
4

The genius of this answer is the `st + ev` trick -- it's the roundtrip to a `POSIXct` that is painful, since you need to explicitly specify the origin. Otherwise `runif(N, as.POSIXct(st), as.POSIXct(et))` gets you 90% of this; but then you need to `as.POSIXct(..., origin="1970-01-01")` – user295691 Aug 06 '15 at 14:51
Why the `sort` command when generating a sequence of random values? – Sam Firke Jan 08 '16 at 15:20
I find it preferable to have dates ordered. You can obviously make that optional via a function parameter. – Dirk Eddelbuettel Jan 08 '16 at 15:43
How to keep all the timezones the same? – d8aninja Apr 25 '16 at 16:10
@d8aninja - the timezones are the same. There is no such thing as CST when daylight savings is in operation so it switches to CDT. You could manually force a timezone that is just +/- hours from GMT (e.g. "Etc/GMT-6") if you don't want daylight savings to have any effect on the time at that location. – thelatemail Oct 15 '19 at 22:07

score 2 · Answer 2 · answered Feb 27 '14 at 17:11

Something like this will work too. Sorry for the random data frame, I just threw that in there so you could see a plot.

data=as.data.frame(list(ID=1:10,
                   variable=rnorm(10,50,10)))

#This function will generate a uniform sample of dates from 
#within a designated start and end date:

rand.date=function(start.day,end.day,data){   
  size=dim(data)[1]    
  days=seq.Date(as.Date(start.day),as.Date(end.day),by="day")  
  pick.day=runif(size,1,length(days))  
  date=days[pick.day]  
}

#This will create a new column within your data frame called date:

data$date=rand.date("2014-01-01","2014-02-28",data)

#and this will order your data frame by date:

data=data[order(data$date),]

#Finally, you can see how the data looks

plot(data$date,data$variable,type="b")

efficiently generate a random sample of times and dates between two dates

2 Answers2

Linked

Related