joining data based on a moving time window in R

Question

I have weather data that was recorded every hour, and location data (X,Y) that was recorded every 4 hours. I want to know what the temperature was at location X,Y. The weather data isn't exactly at the same time. So, I have written this loop for every location to scan through the weather data looking for the "closest" in Date/TIME and extracting the data from that time. The problem is the way Ive written it, for location #2, it scans through the weather data but will not allow the closest time information to be assigned that was assigned for location#1. Say location #1 & 2 are taken within 10 minutes at 6pm and 6:10pm, the closest weather time is 6pm. I can't get it to allow the weather data at 6pm as an option. I kind of set it up like this because 200 locations into my location data set (say 3 months into it), I do not want it starting at time 0 from the weather data, when I know that the closest weather data was just calculated for the last location and that happens to be 3 months into that data set too. Below is some sample data and my code. I don't know if this makes sense.

<h6>####Location data</h6>

<p>X   Y   DateTime <br />
1   2   4/2/2003    18:01:01
3   2   4/4/2003    17:01:33
2   3   4/6/2003    16:03:07
5   6   4/8/2003    15:03:08
3   7   4/10/2003   14:03:06
4   5   4/2/2003    13:02:00
4   5   4/4/2003    12:14:43
4   3   4/6/2003    11:00:56
3   5   4/8/2003    10:02:06</p>

<h2>2   4   4/10/2003   9:02:19</h2>

<p>Weather Data
DateTime        WndSp   WndDir  Hgt
4/2/2003 17:41:00   8.17    102.86  3462.43
4/2/2003 20:00:00   6.70    106.00  17661.00
4/2/2003 10:41:00   6.18    106.00  22000.00
4/2/2003 11:41:00   5.78    106.00  22000.00
4/2/2003 12:41:00   5.48    104.00  22000.00
4/4/2003 17:53:00   7.96    104.29  6541.00
4/4/2003 20:53:00   6.60    106.00  22000.00
4/4/2003 19:41:00   7.82    105.00  7555.00
4/4/2003 7:41:00    6.62    105.00  14767.50
4/4/2003 8:41:00    6.70    106.00  17661.00
4/4/2003 9:41:00    6.60    106.00  22000.00
4/5/2003 20:41:00   7.38    106.67  11156.67
4/6/2003 18:07:00   7.82    105.00  7555.00
4/6/2003 21:53:00   6.18    106.00  22000.00
4/6/2003 21:41:00   6.62    105.00  14767.50
4/6/2003 4:41:00    7.96    104.29  6541.00
4/6/2003 5:41:00    7.82    105.00  7555.00
4/6/2003 6:41:00    7.38    106.67  11156.67
4/8/2003 18:53:00   7.38    106.67  11156.67
4/8/2003 22:53:00   5.78    106.00  22000.00
4/8/2003 1:41:00    5.78    106.00  22000.00
4/8/2003 2:41:00    5.48    104.00  22000.00
4/8/2003 3:41:00    8.17    102.86  3462.43
4/10/2003 19:53:00  6.62    105.00  14767.50
4/10/2003 23:53:00  5.48    104.00  22000.00
4/10/2003 22:41:00  6.70    106.00  17661.00
4/10/2003 23:41:00  6.60    106.00  22000.00
4/10/2003 0:41:00   6.18    106.00  22000.00
4/11/2003 17:41:00  8.17    102.86  3462.43</p>

<h2>4/12/2003 18:41:00  7.96    104.29  6541.0</h2>

.

weathrow = 1
for (i in 1:nrow(SortLoc)) {
    t = 0
    while (t < 1) {
        timedif1 = difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow], units="auto")
        timedif2 =  difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow+1], units="auto") 
        if (timedif2 < 0) {
            if (abs(timedif1) < abs(timedif2)) {
                SortLoc$WndSp[i]=SortWeath$WndSp[weathrow]
                SortLoc$WndDir[i]=SortWeath$WndDir[weathrow]
                SortLoc$Hgt[i]=SortWeath$Hgt[weathrow]
            } else {
                SortLoc$WndSp[i]=SortWeath$WndSp[weathrow+1]
                SortLoc$WndDir[i]=SortWeath$WndDir[weathrow+1]
                SortLoc$Hgt[i]=SortWeath$Hgt[weathrow+1]
            }
            t = 1
        }
        if (abs(SortLoc$DateTime[i] - SortLoc$DateTime[i+1] < 50)) {
            weathrow=weathrow
        } else {
            weathrow = weathrow+1
            #if(weathrow = nrow(SortWeath)){t=1}
        }
    } #end while
}

If this is an R question, could you provide some sample data in an easy to use R format? ie, dput(object) a small amount will do. — Brandon Bertelsen, Apr 12 '11 at 04:39
There is a small set of data above the R code. I am not sure what dput(object) means as I am learning as I go. — Kerry, Apr 12 '11 at 16:33
If you put an object (variable) into dput(object) it will output something that others can use to copy and paste a portion of your data. — Brandon Bertelsen, Apr 12 '11 at 20:56

Ari B. Friedman · Answer 1 · 2011-04-13T04:34:44.287

Here's an example of one strategy you might use. This goes through the weather times one-by-one, then takes the absolute value of the difference between that and every location time, then grabs the lowest difference in times. That solves your look ahead/look back problem. Your dataset seems small enough that moving to a half-vectorized solution should be all the speed gain you need, but if not it should be relatively simple to add on a moving window which only passes some +/- N observations around the last loc.match.index.

w <- as.POSIXct(strptime( c("4/2/2003 17:41:00","4/2/2003 20:00:00","4/2/2003 10:41:00","4/2/2003 11:41:00","4/2/2003 12:41:00"),format="%m/%d/%Y %H:%M:%S"))
l <- as.POSIXct(strptime( c("4/2/2003 18:01:01","4/2/2003 17:01:33","4/2/2003 16:03:07","4/2/2003 15:03:08","4/2/2003 14:03:06","4/2/2003 13:02:00"),format="%m/%d/%Y %H:%M:%S"))

window.size <- 5

findClosest <- function(w.i,l) {
    which.min(abs(w.i-l))
}
makeWindow <- function(loc.match.index,i,window.size,n) {
    win.max <- loc.match.index[i-1] + window.size
    if(win.max > n) {
        win.max <- n
    }
    win.min <- loc.match.index[i-1] - window.size
    if(win.min < 1) {
        win.min <- 1
    }
    return(seq(win.min,win.max))
}

loc.match.index <- integer()
n <- length(w)
# Initialize on whole vector
i <- 1
loc.match.index[i] <- findClosest(w[i],l)
# Continue on window
for(i in seq(2,n)) {
    wndw <- makeWindow(loc.match.index,i,window.size,n)
    loc.match.index[i] <- findClosest(w[i],l[wndw]) 
    # Add the start of the window back to the index that was returned
    loc.match.index[i] <- loc.match.index[i] + min(wndw)-1
}
> loc.match.index
[1] 1 1 5 5 5

There are still areas that could be made more efficient here, but this should be reasonably speedy given that the comparisons are vectorized and the window can be tuned.

So, my dataset is 120,000 locations and my weather has been compiled over 5 years so 71,000 weather data. I had previously written the script to do what I think yo have described but it was taking 8 days to process and tying up 1 computer to do it. For point 1 I calculated the date/time difference for every weather data (71,000 differences) and then found the minimum. Now do that 120,000 times. Thus the idea of a window. I do not need to back calculate nor do I need to search all the way to the end. So, your advice on the window...if you could explain that a little more? — Kerry, Apr 12 '11 at 16:31
I tried running the little script you gave with more weather data and came up with an error stating that the data.frames need to be the same size. In my current attempt, there were more weather points than locations, but in the final version there will be more location points than weather. Either way... — Kerry, Apr 12 '11 at 16:39
@Kerry: I confirmed it works with different vector lengths on my computer. When you copy and paste it in, does it work? I suspect that in moving from my simplified solution to your data.frame-based solution you've made an error. Perhaps post your code? I'd have worked with data.frames if you'd provided the `dput()` output...for future reference, `dput` really does make things easier for everyone. — Ari B. Friedman, Apr 13 '11 at 04:33

Marek · Accepted Answer · 2011-04-14T11:02:15.333

You could use findInterval function to find nearest value:

# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest

In your case some as.numeric might be needed.

# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp  <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt    <- SortWeath$Hgt[id_min]

Some addition: you should never, ABSOLUTELY NEWER add values to data.frame in for-loop. Check this comparison:

N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
    vector = {for (i in 1:N) x[i]<-1},
    data.frame = {for (i in 1:N) X$x[i]<-1}
)
#         test replications elapsed relative
# 2 data.frame          100    4.32    22.74
# 1     vector          100    0.19     1.00

data.frame version is over 20 times slower, and if more rows it contain then difference is bigger.

So if you change you script and first initialize result vectors:

tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))

then update values in loop

tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...

and at the end (outside the loop) update proper columns:

SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt

It should run much faster.

thank you so much! this works great. and no need to loop! now to test it completely! — Kerry, Apr 14 '11 at 00:27

joining data based on a moving time window in R

2 Answers2

Linked