Tricky data merge on date in R

Question

1. The datasets

I have two large data sets in csv format - a forecast data set and a observations data set.

There is forecast data which has a time the forecast is issued (IssueDatetime), a time the forecast ends (endtime), and a time the forecast is valid (forecastTime). There are many forecast regions, and the data set spans over a 20 year period. The regions are also identified by a polygon id which I assigned via GIS processing (poly_id), which I would like to use to merge the second data set -because the observations and forecasts must occur in the same place at the same time.

         IssueDatetime             endtime     Regions        forecastTime
1  2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-01 09:00:00
2  2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-01 18:00:00
3  2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-02 06:00:00
4  2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-02 13:00:00
5  2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-02 09:00:00
6  2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-02 12:00:00
7  2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-03 12:00:00
8  2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-03 09:00:00
9  2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-03 15:00:00
10 2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-04 00:00:00
   forecastHour forecast_Dir forecast_WindSpeed_low forecast_WindSpeed_gust
1             0          270                     35                      NA
2             9          270                     25                      NA
3            21          225                     15                      NA
4            28          270                     35                      NA
5             0          270                     25                      NA
6             3          270                     40                      NA
7            27          315                     25                      NA
8             0          270                     35                      NA
9             6          315                     25                      NA
10           15          315                     25                      NA
   forecast_WindSpeed_high forecast_WindSpeed_exception_1_type
1                       NA                                    
2                       NA                                    
3                       NA                                    
4                       NA                                    
5                       NA                                    
6                       NA                                    
7                       35                                    
8                       NA                                    
9                       35                                    
10                      NA                                    
   forecast_Dir_exception_1 forecast_WindSpeed_low_exception_1
1                                                           NA
2                                                           NA
3                                                           NA
4                                                           NA
5                                                           NA
6                                                           NA
7                                                           NA
8                                                           NA
9                                                           NA
10                                                          NA
   forecast_WindSpeed_high_exception_1  poly_id
1                                   NA fea1-290
2                                   NA fea1-290
3                                   NA fea1-290
4                                   NA fea1-290
5                                   NA fea1-290
6                                   NA fea1-290
7                                   NA fea1-290
8                                   NA fea1-290
9                                   NA fea1-290
10                                  NA fea1-290

2. My approach(es)

I'm quite new to R, and I've thought about this problem in several ways:

Merging the data sets date and then filtering when the poly_id matched either region_id or the subRegion_id.
Expanding/upsampling the forecast data so that all hours have a forecast, not just the hours where the forecast changes.

3. Issues I ran into

The result of my first approach was a successful data merge. However, the results left out any observation that didn't occur on a forecastTime - Because many of the observations didn't occur on a forecast time, but actually between forecast times or, after the last forecast time and before the end time.

I was unsuccessful in my second approach in upsampling the data.

4. My ideal output

I need either:

The forecastTime column in the first data set to be upsampled to include all the hours in between forecast times. From there i think I could use my original approach and merge the datasets myself.
Merge the data sets which include observations which occur on and between forecast times, and also matching the poly_id from the forecast data set with region_id or subRegion_id from the observations data set.

Here are smaller versions of the data sets in proper format according to How to make a great R reproducible example

forecast data:

structure(list(IssueDatetime = structure(c(1L, 1L, 1L, 1L, 2L, 
2L, 2L, 3L, 3L, 3L), .Label = c("2013-01-01 09:00:00", "2013-01-02 09:00:00", 
"2013-01-03 09:00:00"), class = "factor"), Regions = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "SOUTH COAST", class = "factor"), 
    forecastTime = structure(c(1L, 2L, 3L, 6L, 4L, 5L, 8L, 7L, 
    9L, 10L), .Label = c("2013-01-01 09:00:00", "2013-01-01 18:00:00", 
    "2013-01-02 06:00:00", "2013-01-02 09:00:00", "2013-01-02 12:00:00", 
    "2013-01-02 13:00:00", "2013-01-03 09:00:00", "2013-01-03 12:00:00", 
    "2013-01-03 15:00:00", "2013-01-04 00:00:00"), class = "factor"), 
    forecastHour = c(0, 9, 21, 28, 0, 3, 27, 0, 6, 15), forecast_Dir = structure(c(2L, 
    2L, 1L, 2L, 2L, 2L, 3L, 2L, 3L, 3L), .Label = c("225", "270", 
    "315"), class = "factor"), forecast_WindSpeed_low = c(35, 
    25, 15, 35, 25, 40, 25, 35, 25, 25), forecast_WindSpeed_gust = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_), forecast_WindSpeed_high = c(NA, 
    NA, NA, NA, NA, NA, 35, NA, 35, NA), forecast_WindSpeed_exception_1_type = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"), 
    forecast_Dir_exception_1 = structure(c(1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"), forecast_WindSpeed_low_exception_1 = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), forecast_WindSpeed_high_exception_1 = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), poly_id = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "fea1-290", class = "factor")), .Names = c("IssueDatetime", 
"Regions", "forecastTime", "forecastHour", "forecast_Dir", "forecast_WindSpeed_low", 
"forecast_WindSpeed_gust", "forecast_WindSpeed_high", "forecast_WindSpeed_exception_1_type", 
"forecast_Dir_exception_1", "forecast_WindSpeed_low_exception_1", 
"forecast_WindSpeed_high_exception_1", "poly_id"), row.names = c(NA, 
10L), class = "data.frame")

observations data:

structure(list(shipName = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = "teleost", class = "factor"), dateTime = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L), .Label = c("2000-01-04 18:00", 
"2000-01-05 11:00", "2000-01-06 14:00", "2000-01-06 17:00", "2000-01-08 1:00", 
"2000-01-09 10:00", "2000-01-09 15:00", "2000-01-10 12:00"), class = "factor"), 
    windDir = c(29L, 16L, 29L, 27L, 19L, 28L, 24L, 1L, 1L, 1L
    ), windSpeed = c(9L, 13L, 37L, 30L, 19L, 13L, 9L, 13L, 13L, 
    13L), region_id = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 
    3L, 3L, 3L), .Label = c("fea1-284", "fea1-287", "fea1-314"
    ), class = "factor"), subRegion_id = structure(c(3L, 2L, 
    2L, 2L, 2L, 1L, 1L, 4L, 5L, 6L), .Label = c("fea1-286", "fea1-288", 
    "fea1-289", "fea1-319", "fea1-323", "fea1-325"), class = "factor")), .Names = c("shipName", 
"dateTime", "windDir", "windSpeed", "region_id", "subRegion_id"
), row.names = c(NA, 10L), class = "data.frame")

sample code of where I've gotten thus far:

library(dplyr)

#reads csv files as dataframes
fd <- read.csv("forecastData_DONE.csv", header = TRUE)
obs <- read.csv("dfo_obs_UPDATE.csv", header=TRUE)


fd$forecast_windSpeed_avg <- (fd$forecast_WindSpeed_low + fd$forecast_WindSpeed_high)/2

fd$date_time  <-  as.POSIXct(fd$forecastTime) 
fd$endtime <- as.POSIXct(fd$endtime)
obs$date_time <-  as.POSIXct(obs$dateTime) 


fd <- fd %>% select(IssueDatetime, Regions,  forecastTime, forecastHour, forecast_Dir, forecast_WindSpeed_low, forecast_WindSpeed_gust, forecast_WindSpeed_high, forecast_WindSpeed_exception_1_type, forecast_Dir_exception_1, forecast_WindSpeed_low_exception_1, forecast_WindSpeed_high_exception_1, poly_id)

df <- left_join(fd, obs, by = "date_time")



i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)



df_final <- df[ which(df$poly_id==df$region_id | df$poly_id==df$subRegion_id), ]

This code merges the data by date and region, but leaves out all the data that occurs when the forecasTime and observation time are not an exact match (which is most of the time).

Need the to be between IssueDatetime and endtime, however the forecast changes on the forecast times within the date range. For example if an observation occurred at 2013-01-01 15:00:00, that is within the IssueDatetime = 2013-01-01 09:00:00 and endtime = 2013-01-03 03:00:00. The corresponding forecast time is forecastTIme = 2013-01-01 09:00:00 because the observation occurs before the next update which is 2013-01-01 18:00:00. Another option is to expand the forecastTime column to hourly. — Jordan Ford, Aug 10 '20 at 02:08
Your example does not include the column endtime for the forecast data. This should be pretty straightforward though but dplyr's join for now does not seem to support custom conditions. So you can either do a full join and filter or use the package data.table: https://www.r-bloggers.com/in-between-a-rock-and-a-conditional-join/ — karo, Aug 10 '20 at 10:38

Tricky data merge on date in R

0 Answers0