2

I have a time dependent variable represented as two vectors: a vector of times (sorted) and a vector of values at those times. I want to resample this variable at different times specified by a different sorted vector of times.

In another language I would do a simultaneous walk through the two sorted time vectors. i.e. linear search from the beginning of the old time vector until I find the time closest to the first element in the new time vector, then continuing on from that point in the old vector to find the time closest to the second element in the new vector, etc. This gives a solution that is O(n).

The key here is that the two vectors of times are not the same length and the elements are not paired one-to-one so something like map2 or walk2 isn't what I want.

I can implement the simultaneous walk with a for loop (see code below), and it works, but it's slow. I also have another solution that is more R codey, but it's O(n^2) so it also winds up being slow. Is there an R way of doing this that uses internal R implementations to wind up with an O(n) solution?

Alternatively, is there an R function that could replace my get_closest() with a binary search so at least it would be O(nlogn)?

From my searching I suspect the answer is going to be, "write a C function that you call from R", but I'm fairly new to R so I wanted to check that I'm not missing something.

EDIT:

I should have made clear that the values in new_times may not exist in old_times. I want to find the index in old_times where the time is closest to each entry in new_times. In my real application I will then do linear interpolation, but this question is just about searching for the nearest neighbor.

library(tidyverse)

# input values given
old_times  <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
old_values <- c(3, 7, 6, 7,  8,  9,  7,  6,  4,  6)
new_times  <- c(4.1, 9.6, 12.3, 17.8)

Desired output is

new_values <- c(7, 8, 9, 4)

My attempt

new_values <- rep(NA, length(new_times))
old_index  <- 1

for (new_index in 1:length(new_times)) {
  while (old_index < length(old_times) &&
         old_times[old_index] < new_times[new_index]) {
    old_index <- old_index + 1
  }

  # I could now do interpolation if the value of new_times is in between
  # two values in old_times.  The key is I have a correspondence that
  # new_times[new_index] is close in time to old_times[old_index].
  new_values[new_index] <- old_values[old_index]
}


# Here's an alternative way to do it that uses more R internals,
# but winds up being O(n^2).

# Get the index in old_times closest to new_time.
# This is O(n).
get_closest <- function(new_time, old_times) {
  return(which.min(abs(new_time - old_times)))
}

# Call get_closest on each element of new_times.
# This is O(n^2).
new_indices <- unlist(map(new_times, get_closest, old_times))

# Slice the list of old values to get new values.
new_values2 <- old_values[new_indices]
Bob Steinke
  • 116
  • 7

2 Answers2

2

We can use match

old_values[match(new_times, old_times)]
# [1] 7 8 9 4

match(new_times, old_times) returns "a vector of the positions of (first) matches of its first argument in its second.", i.e.

# [1] 2 5 6 9

We can use this result to extract desired values from old_values using [.


We can also use %in% which returns a boolean vector

old_values[old_times %in% new_times]

Thanks to @Andrew

markus
  • 25,843
  • 5
  • 39
  • 58
  • Ok, this looks promising. What if the values in new_times aren't in old_times, but I just want to get the index of the entry that is closest? I glossed over this in my example to keep the code from getting too long, but in my real application it might ask for the value at time 2.5 and I have to interpolate. – Bob Steinke Apr 09 '19 at 20:20
  • 1
    As an "alternative", `old_values[old_times %in% new_times]` is doing the same thing under the hood. I.e., see `?match`. I have found that `%in%` usually scales a *little* bit better (not noticeable in many cases). – Andrew Apr 09 '19 at 20:22
  • 1
    @BobSteinke It's best to create a minimal example and show expected output. You might consider to ask a new question. – markus Apr 09 '19 at 20:23
0

It looks like what is best for this is to use data.table. I found out about it in this other question:

Find closest value in a vector with binary search

There might be a possible optimization for data.table if it knows that both the search-for and search-in vectors are sorted it can do O(n) search instead of O(nlogn), but data.table is already very fast in my application.

Bob Steinke
  • 116
  • 7