10

I have one file (location) that has an x,y coordinates and a date/time identification. I want to get information from a second table (weather) that has a "similar" date/time variable and the co-variables (temperature and wind speed). The trick is the date/time are not exactly the same numbers in both tables. I want to select the weather data that is closest from the location data. I know I need to do some loops and thats about it.

Example location                                    example weather

x    y     date/time                         date/time           temp        wind
1    3     01/02/2003 18:00                  01/01/2003 13:00     12          15
2    3     01/02/2003 19:00                  01/02/2003 16:34     10          16
3    4     01/03/2003 23:00                  01/02/2003 20:55     14          22
2    5     01/04/2003 02:00                  01/02/2003 21:33     14          22
                                             01/03/2003 00:22     13          19
                                             01/03/2003 14:55     12          12
                                             01/03/2003 18:00     10          12
                                             01/03/2003 23:44     2           33
                                             01/04/2003 01:55     6           22

So the final output would be a table with the correctly "best" matched weather data to the location data

x    y     datetime               datetime           temp        wind
1    3     01/02/2003 18:00  ----  01/02/2003 16:34     10          16
2    3     01/02/2003 19:00  ----  01/02/2003 20:55     14          22
3    4     01/03/2003 23:00  ----  01/03/2003 00:22     13          19               
2    5     01/04/2003 02:00  ----  01/04/2003 01:55     6           22

Any suggestions where to start? I am trying to do this in R

David LeBauer
  • 31,011
  • 31
  • 115
  • 189
Kerry
  • 793
  • 14
  • 33
  • 2
    I expect you'll be able to solve this without conventional loops. One statement from the apply family combined with an approx() may well do the trick. Suggestions where to start? Give us some proper data to work with. Rather than pasting text, construct your data in R, then paste the results of dput() here, so we can easily reconstruct your data and write some code that can be tested. – Andrie Mar 24 '11 at 21:47
  • 1
    Why is row 3 of location joined to row 5 of weather? Isn't row 8 of weather the closest to 01/03/2003 23:00 ? – Matt Dowle Mar 24 '11 at 21:57
  • @ Matthew - you are correct that was a mistake of generating data on the fly – Kerry Mar 24 '11 at 22:52

2 Answers2

5

One fast and short way may be using data.table. If you create two data.table's X and Y, both with keys, then the syntax is :

X[Y,roll=TRUE]

We call that a rolling join because we roll the prevailing observation in X forward to match the row in Y. See the examples in ?data.table and the introduction vignette.

Another way to do this is the zoo package which has locf (last observation carried forward), and possibly other packages too.

I'm not sure if you mean closest in terms of location, or time. If location, and that location is x,y coordinates then you will need some distance measure in 2D space I guess. data.table only does univariate 'closest' e.g. by time. Reading your question for a 2nd time it does seem you mean closest in the prevailing sense though.

EDIT: Seen the example data now. data.table won't do this in one step because although it can roll forwards or backwards, it won't roll to the nearest. You could do it with an extra step using which=TRUE and then test whether the one after the prevailing was actually closer.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thank you, I will look at that to see if it does a better job, or faster anyway because as of right now this is what I have done based on some stuff I saw for MySQL scripts **for (i in 1:nrow(loc)) { index = which.min(abs(loc$DateTime[i] - weather$DateTime)) loc$WndSp[i] = weather$WndSp[index] }** – Kerry Mar 24 '11 at 22:47
5

I needed to bring that data in as data and time separately and then paste and format

location$dt.time <- as.POSIXct(paste(location$date, location$time), 
                                 format="%m/%d/%Y %H:%M")

And the same for weather

Then for each value of date.time in location, find the entry in weather that has the lowest absolute values for the time differences:

 sapply(location$dt.time, function(x) which.min(abs(difftime(x, weather$dt.time))))
# [1] 2 3 8 9
 cbind(location, weather[ sapply(location$dt.time, 
                      function(x) which.min(abs(difftime(x, weather$dt.time)))), ])

  x y       date  time             dt.time       date  time temp wind             dt.time
2 1 3 01/02/2003 18:00 2003-01-02 18:00:00 01/02/2003 16:34   10   16 2003-01-02 16:34:00
3 2 3 01/02/2003 19:00 2003-01-02 19:00:00 01/02/2003 20:55   14   22 2003-01-02 20:55:00
8 3 4 01/03/2003 23:00 2003-01-03 23:00:00 01/03/2003 23:44    2   33 2003-01-03 23:44:00
9 2 5 01/04/2003 02:00 2003-01-04 02:00:00 01/04/2003 01:55    6   22 2003-01-04 01:55:00

 cbind(location, weather[ 
                  sapply(location$dt.time, 
                    function(x) which.min(abs(difftime(x, weather$dt.time)))), ])[ #pick columns
                          c(1,2,5,8,9,10)]

  x y             dt.time temp wind           dt.time.1
2 1 3 2003-01-02 18:00:00   10   16 2003-01-02 16:34:00
3 2 3 2003-01-02 19:00:00   14   22 2003-01-02 20:55:00
8 3 4 2003-01-03 23:00:00    2   33 2003-01-03 23:44:00
9 2 5 2003-01-04 02:00:00    6   22 2003-01-04 01:55:00

My answers seem a bit different than yours but another reader has already questioned your abilities to do the matching properly by hand.

IRTFM
  • 258,963
  • 21
  • 364
  • 487