Searching Lists in R based on closest date

Question

Currently trying to write some that would return the last date from an ordered list that is less than date X.

Right now I have this: it gets a list of days, and gets an index off the day we're going to be doing search on and range of how many dates we want to go back.

After that it checks if the date exists or not (e.g. Feb 30th). If the date doesn't exist, it decreases the date by 1 and then applies filter again (otherwise it tries to subtract 1 day from NA and fails).

library(lubridate)
getDate <- function(dates,day,range){
    if(range == 'single')
        {return (day-1)}

    z <- switch(range,
        single = days(1),
        month = days(30),
        month3 = months(3),
        month6 = months(6),
        year = years(1)
        )

    new_day <-(dates[day]-z)
    i <- 1
    while (is.na(new_day)){
        new_day <- dates[day] - days(i) - z 
    }
    ind<-which.min(abs (diff <-(new_day-dates)))

    if (diff[ind] < 0)
    {ind <- ind -1}

    return (ind[1])
}

While this function works, the problem is the speed efficiency. I have a feeling that which.min(abs()) is far from the quickest and I'm wondering if there are any better alternatives (outside of also writing my own function to search lists).

stocks <- list(structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), mday = c(2L, 3L, 4L, 7L, 8L, 9L, 10L, 11L, 14L, 15L, 16L, 17L,
18L, 22L, 23L, 24L, 25L, 28L, 29L, 30L, 31L, 1L, 4L, 5L, 6L), mon = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L), year = c(108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L,
108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L,
108L, 108L, 108L), wday = c(3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), yday = c(1L, 2L, 3L, 6L, 7L,
8L, 9L, 10L, 13L, 14L, 15L, 16L, 17L, 21L, 22L, 23L, 24L, 27L, 28L, 29L, 30L,
31L, 34L, 35L, 36L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min",
"hour", "mday", "mon", "year", "wday", "yday", "isdst"), tzone = "UTC",
class = c("POSIXlt", "POSIXt")))

old_pos <- getDate(stocks[[1]],21,"month") #should return 0
old_pos <- getDate(stocks[[1]],22,"month") #should return 1

This does not return a vector, nor a date, only an index and the main question isn't about working (which it does), but optimizing it.

The value is later on being used in another function, one possible speed up is to first match all of the old indexes to new ones and then return that as another list. However not sure if it would offer any speed up.

http://pastebin.com/sDXMSft6 or http://pastebin.com/vLVvwjHd (The only value being passed in as stocks is stocks[[1]]) — Gray_Hound, Jun 21 '13 at 14:54
Rather than pasting a link to printed data, please provide the output from `dput(head(stocks[[1]],20))` and an example of how we should call your `getDate` function. Then your question would contain a minimal, [reproducible example](http://stackoverflow.com/q/5963269/271616). — Joshua Ulrich, Jun 21 '13 at 15:03
@Gray_Hound you can be inspired by my answer t create a reproducible example with the expected result. Creating a good example is generally half path to the solution. — agstudy, Jun 21 '13 at 15:07
I agree with the others who are encouraging you to improve how you ask a question. There is too much irrelevant code and excessive data in the question as it now appears. Its not just a matter of dumping whatever you have onto SO but extracting the meat of it and asking just that. I have answered @agstudy's reformulation of your question in my answer. I suspect not many others would be willing to wade through what you have presented either. — G. Grothendieck, Jun 21 '13 at 16:23
@G.Grothendieck: I just shortened the data and adjusted the example. — Joshua Ulrich, Jun 21 '13 at 16:42

G. Grothendieck · Answer 1 · 2013-06-21T16:19:50.397

Using @agstudy's reformulation including sDate and x.Date

data.table

We can perform the calculations in data.table like this where the first column shows the original date in sDate and the second column is the corresponding x.Date date:

> library(data.table)
> data.table(date = x.Date, x.Date, key = "date")[J(sDate),, roll = TRUE]
         date     x.Date
1: 2003-02-03 2003-02-02
2: 2003-02-12 2003-02-10
3: 2003-02-16 2003-02-15

sqldf Using sqldf its like this:

> library(sqldf)
> sDateDF <- data.frame(sDate = sDate)
> xDateDF <- data.frame(xDate = x.Date)
> 
> sqldf("select s.sdate sDate, max(x.xdate) xDate 
+   from sDateDF s join xDateDF x on x.xDate <= s.sDate 
+   group by s.sDate")
       sDate      xDate
1 2003-02-03 2003-02-02
2 2003-02-12 2003-02-10
3 2003-02-16 2003-02-15

zoo

Using zoo, we create two zoo series, merge them and use na.locf like this. The result is the x.Date corresponding to each sDate (i.e. the second column in either of the above solutions):

> library(zoo)
>
> zx <- zoo(seq_along(x.Date), x.Date)
> zs <- zoo(seq_along(sDate), sDate)
> x.Date[na.locf(merge(zx, zs))[sDate, "zx"]]
[1] "2003-02-02" "2003-02-10" "2003-02-15"

agstudy · Answer 2 · 2013-06-21T16:10:51.280

2

If I understand you have a vector of dates, for example :

x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14,20) 
"2003-02-02" "2003-02-04" "2003-02-08" "2003-02-10" "2003-02-15" "2003-02-21"

and giving a vector of dates, for example:

sDate <- as.Date("2003-02-01") + c(2,11,15)

You try to get the closer date in x.Date to this giving date but less than this date:

 lapply(sDate,function(x)max(x.Date[x.Date-x <=0]))
[[1]]
[1] "2003-02-02"

[[2]]
[1] "2003-02-10"

[[3]]
[1] "2003-02-15"

edited Jun 21 '13 at 16:10

answered Jun 21 '13 at 14:57

agstudy

119,832
17
199
261

Yes, with only specific thing being that sDate is one of the dates in a list - one of the fixed time intervals. – Gray_Hound Jun 21 '13 at 14:59
@Gray_Hound I edit my answer. sDate is a list of dates now. – agstudy Jun 21 '13 at 15:06
@agstudy `is.list(sDate)` is not `TRUE`. – Roland Jun 21 '13 at 15:10
@Roland yes right but I don't get your point... do you mean I `lapply(as.list(sDate),...)` – agstudy Jun 21 '13 at 15:14
I just think we should try to be exact with terminology in order to avoid confusion. – Roland Jun 21 '13 at 15:18
@Roland it is not a vector neither ..I can't find the right terminology here. – agstudy Jun 21 '13 at 15:40
1

@agstudy `is.vector(unclass(sDate)) == TRUE`. It's a vector with an attribute. – GSee Jun 21 '13 at 16:02

Searching Lists in R based on closest date

2 Answers2