1

I am looking for a way to interpolate as I am merging two data sets.

I have one data frame with readings for a number of different data loggers, and a second with spot measurements. I need to match up the hand measurements with the logger readings from the same time so I can compare them and compute off-sets. Unfortunately, the logger measurements are regularly spaced (hourly), while the hand measurements are not at the intervals.

I would like to interpolate the logger values to get the logger value for the reading, but am struggling on how to make sure I'm grabbing the correct measurement.

sample data

library(tidyverse)

Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)

loggers <- tibble(
      datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
      Site = rep(LETTERS[1:4], each = 49),
      reading = c(rnorm(49, mean = 10, sd = 3), 
                  rnorm(49, mean = 15, sd = 3), 
                  rnorm(49, mean = 20, sd = 3), 
                  rnorm(49, mean = 25, sd = 3)
                  )
)

hand_meas <- tibble(
      Site = rep(LETTERS[1:4], each = 2),
      datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
      meas = c(10, 11, 14, 16, 19, 19.2, 23, 24) 
)      

head(loggers)
# # A tibble: 6 x 3
# datetime            Site  reading
# <dttm>              <chr>   <dbl>
#       1 2019-12-18 00:00:00 A        7.65
# 2 2019-12-18 01:00:00 A        6.99
# 3 2019-12-18 02:00:00 A       13.8 
# 4 2019-12-18 03:00:00 A       12.3 
# 5 2019-12-18 04:00:00 A       11.6 
# 6 2019-12-18 05:00:00 A       14.3 

head(hand_meas)
# # A tibble: 6 x 3
# Site  datetime             meas
# <chr> <dttm>              <dbl>
# 1 A     2019-12-18 12:35:00  10  
# 2 A     2019-12-19 13:45:00  11  
# 3 B     2019-12-18 12:55:00  14  
# 4 B     2019-12-19 13:15:00  16  
# 5 C     2019-12-18 12:35:00  19  
# 6 C     2019-12-19 13:45:00  19.2


My typical approach would be to either left_join() the logger data to the hand measurements, or to use approx() to interpolate values, but neither of these work in this situation.

## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))

# # A tibble: 8 x 4
# Site  datetime             meas reading
# <chr> <dttm>              <dbl>   <dbl>
#       1 A     2019-12-18 12:35:00  10        NA
# 2 A     2019-12-19 13:45:00  11        NA
# 3 B     2019-12-18 12:55:00  14        NA
# 4 B     2019-12-19 13:15:00  16        NA
# 5 C     2019-12-18 12:35:00  19        NA
# 6 C     2019-12-19 13:45:00  19.2      NA
# 7 D     2019-12-18 12:55:00  23        NA
# 8 D     2019-12-19 13:15:00  24        NA

## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# 
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
# 
# Warning message:
#       In regularize.values(x, y, ties, missing(ties)) :
#       collapsing to unique 'x' values

I could also use data.table to get the nearest logger value, but my real data has significant fluctuations over the course of the day, so need to interpolate the measurements from either side

# This is close, Using data.table to join based on nearest timestamp see question 31818444
#  https://stackoverflow.com/questions/31818444/join-two-data-frames-in-r-based-on-closest-timestamp
library(data.table)

setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site            datetime meas logger_reading
# 1:    A 2019-12-18 12:35:00 10.0       12.21952
# 2:    A 2019-12-19 13:45:00 11.0       13.19621
# 3:    B 2019-12-18 12:55:00 14.0       13.86335
# 4:    B 2019-12-19 13:15:00 16.0       15.64910
# 5:    C 2019-12-18 12:35:00 19.0       20.76380
# 6:    C 2019-12-19 13:45:00 19.2       19.54722

Can anyone suggest a way to do something like approx() while limiting the source data based on the site? Or a data.table approach that interpolates rather than strictly matching?

Brian Fisher
  • 1,305
  • 7
  • 17

1 Answers1

0

I realized I could combine the two approaches by writing a function that limits the source data by site, then interpolates to the reading.

approx_by_site = function(site_id, datetime_in) {
      dt = loggers %>%
            filter(Site == site_id)
      out = approx(dt$datetime, dt$reading, datetime_in)
      return(out$y)
}
with_readings = hand_meas %>%
      rowwise() %>%  # required or returns bad data
      mutate( Logger = approx_by_site(Site, datetime)) %>%
      ungroup()

with_readings
# # A tibble: 8 x 4
# Site  datetime             meas Logger
# <chr> <dttm>              <dbl>  <dbl>
# 1 A     2019-12-18 12:35:00  10     11.0
# 2 A     2019-12-19 13:45:00  11     12.7
# 3 B     2019-12-18 12:55:00  14     13.9
# 4 B     2019-12-19 13:15:00  16     16.1
# 5 C     2019-12-18 12:35:00  19     20.7
# 6 C     2019-12-19 13:45:00  19.2   20.0
# 7 D     2019-12-18 12:55:00  23     24.4
# 8 D     2019-12-19 13:15:00  24     26.9

This works, I think the rowwise() call is needed because of the filter in my function.
I am concerned that this method could bog down if the source (logger) data file gets large because of the repeated calls to filter.

Brian Fisher
  • 1,305
  • 7
  • 17