1. The datasets
I have two large data sets in csv format - a forecast data set and a observations data set.
There is forecast data which has a time the forecast is issued (IssueDatetime), a time the forecast ends (endtime), and a time the forecast is valid (forecastTime). There are many forecast regions, and the data set spans over a 20 year period. The regions are also identified by a polygon id which I assigned via GIS processing (poly_id), which I would like to use to merge the second data set -because the observations and forecasts must occur in the same place at the same time.
IssueDatetime endtime Regions forecastTime
1 2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-01 09:00:00
2 2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-01 18:00:00
3 2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-02 06:00:00
4 2013-01-01 09:00:00 2013-01-03 03:00:00 SOUTH COAST 2013-01-02 13:00:00
5 2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-02 09:00:00
6 2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-02 12:00:00
7 2013-01-02 09:00:00 2013-01-04 03:00:00 SOUTH COAST 2013-01-03 12:00:00
8 2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-03 09:00:00
9 2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-03 15:00:00
10 2013-01-03 09:00:00 2013-01-05 03:00:00 SOUTH COAST 2013-01-04 00:00:00
forecastHour forecast_Dir forecast_WindSpeed_low forecast_WindSpeed_gust
1 0 270 35 NA
2 9 270 25 NA
3 21 225 15 NA
4 28 270 35 NA
5 0 270 25 NA
6 3 270 40 NA
7 27 315 25 NA
8 0 270 35 NA
9 6 315 25 NA
10 15 315 25 NA
forecast_WindSpeed_high forecast_WindSpeed_exception_1_type
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 35
8 NA
9 35
10 NA
forecast_Dir_exception_1 forecast_WindSpeed_low_exception_1
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
forecast_WindSpeed_high_exception_1 poly_id
1 NA fea1-290
2 NA fea1-290
3 NA fea1-290
4 NA fea1-290
5 NA fea1-290
6 NA fea1-290
7 NA fea1-290
8 NA fea1-290
9 NA fea1-290
10 NA fea1-290
2. My approach(es)
I'm quite new to R, and I've thought about this problem in several ways:
Merging the data sets date and then filtering when the poly_id matched either region_id or the subRegion_id.
Expanding/upsampling the forecast data so that all hours have a forecast, not just the hours where the forecast changes.
3. Issues I ran into
The result of my first approach was a successful data merge. However, the results left out any observation that didn't occur on a forecastTime - Because many of the observations didn't occur on a forecast time, but actually between forecast times or, after the last forecast time and before the end time.
I was unsuccessful in my second approach in upsampling the data.
4. My ideal output
I need either:
The forecastTime column in the first data set to be upsampled to include all the hours in between forecast times. From there i think I could use my original approach and merge the datasets myself.
Merge the data sets which include observations which occur on and between forecast times, and also matching the poly_id from the forecast data set with region_id or subRegion_id from the observations data set.
Here are smaller versions of the data sets in proper format according to How to make a great R reproducible example
forecast data:
structure(list(IssueDatetime = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L), .Label = c("2013-01-01 09:00:00", "2013-01-02 09:00:00",
"2013-01-03 09:00:00"), class = "factor"), Regions = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "SOUTH COAST", class = "factor"),
forecastTime = structure(c(1L, 2L, 3L, 6L, 4L, 5L, 8L, 7L,
9L, 10L), .Label = c("2013-01-01 09:00:00", "2013-01-01 18:00:00",
"2013-01-02 06:00:00", "2013-01-02 09:00:00", "2013-01-02 12:00:00",
"2013-01-02 13:00:00", "2013-01-03 09:00:00", "2013-01-03 12:00:00",
"2013-01-03 15:00:00", "2013-01-04 00:00:00"), class = "factor"),
forecastHour = c(0, 9, 21, 28, 0, 3, 27, 0, 6, 15), forecast_Dir = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 3L, 2L, 3L, 3L), .Label = c("225", "270",
"315"), class = "factor"), forecast_WindSpeed_low = c(35,
25, 15, 35, 25, 40, 25, 35, 25, 25), forecast_WindSpeed_gust = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), forecast_WindSpeed_high = c(NA,
NA, NA, NA, NA, NA, 35, NA, 35, NA), forecast_WindSpeed_exception_1_type = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
forecast_Dir_exception_1 = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"), forecast_WindSpeed_low_exception_1 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), forecast_WindSpeed_high_exception_1 = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), poly_id = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "fea1-290", class = "factor")), .Names = c("IssueDatetime",
"Regions", "forecastTime", "forecastHour", "forecast_Dir", "forecast_WindSpeed_low",
"forecast_WindSpeed_gust", "forecast_WindSpeed_high", "forecast_WindSpeed_exception_1_type",
"forecast_Dir_exception_1", "forecast_WindSpeed_low_exception_1",
"forecast_WindSpeed_high_exception_1", "poly_id"), row.names = c(NA,
10L), class = "data.frame")
observations data:
structure(list(shipName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "teleost", class = "factor"), dateTime = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L), .Label = c("2000-01-04 18:00",
"2000-01-05 11:00", "2000-01-06 14:00", "2000-01-06 17:00", "2000-01-08 1:00",
"2000-01-09 10:00", "2000-01-09 15:00", "2000-01-10 12:00"), class = "factor"),
windDir = c(29L, 16L, 29L, 27L, 19L, 28L, 24L, 1L, 1L, 1L
), windSpeed = c(9L, 13L, 37L, 30L, 19L, 13L, 9L, 13L, 13L,
13L), region_id = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
3L, 3L, 3L), .Label = c("fea1-284", "fea1-287", "fea1-314"
), class = "factor"), subRegion_id = structure(c(3L, 2L,
2L, 2L, 2L, 1L, 1L, 4L, 5L, 6L), .Label = c("fea1-286", "fea1-288",
"fea1-289", "fea1-319", "fea1-323", "fea1-325"), class = "factor")), .Names = c("shipName",
"dateTime", "windDir", "windSpeed", "region_id", "subRegion_id"
), row.names = c(NA, 10L), class = "data.frame")
sample code of where I've gotten thus far:
library(dplyr)
#reads csv files as dataframes
fd <- read.csv("forecastData_DONE.csv", header = TRUE)
obs <- read.csv("dfo_obs_UPDATE.csv", header=TRUE)
fd$forecast_windSpeed_avg <- (fd$forecast_WindSpeed_low + fd$forecast_WindSpeed_high)/2
fd$date_time <- as.POSIXct(fd$forecastTime)
fd$endtime <- as.POSIXct(fd$endtime)
obs$date_time <- as.POSIXct(obs$dateTime)
fd <- fd %>% select(IssueDatetime, Regions, forecastTime, forecastHour, forecast_Dir, forecast_WindSpeed_low, forecast_WindSpeed_gust, forecast_WindSpeed_high, forecast_WindSpeed_exception_1_type, forecast_Dir_exception_1, forecast_WindSpeed_low_exception_1, forecast_WindSpeed_high_exception_1, poly_id)
df <- left_join(fd, obs, by = "date_time")
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
df_final <- df[ which(df$poly_id==df$region_id | df$poly_id==df$subRegion_id), ]
This code merges the data by date and region, but leaves out all the data that occurs when the forecasTime and observation time are not an exact match (which is most of the time).