Closest other Value in the same Vector

Question

I have a vector

set.seed(2)
x <- sample.int(20, 5)

[1]  4 14 11  3 16

Now, for every element I want to find

the element with the minimum distance (min(abs(x[i]-x[-i])) for element i), which here would be

[1]  3 16 14  4 14

the (first) index of the element with the minimum distance, which here would be

[1] 4 5 2 1 2

The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector is not the answer.

If the actual answer is out there, sorry - I didn't find it.

G. Grothendieck · Accepted Answer · 2020-02-22T03:52:48.910

1) Rfast Using dista in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.

library(Rfast)
x <- c(4, 14, 11, 3, 16) # input

x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1]  3 16 14  4 14

2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.

DF <- data.frame(x)   # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x)) 
  from DF a 
  left join DF b on a.x != b.x 
  group by a.rowid")[1:2]

giving:

   x nearest
1  4       3
2 14      16
3 11      14
4  3       4
5 16      14

3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.

library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1]  3 16 14  4 14

4) Base R Using ix and least from (3) we can mimic (3) using only base functions as follows.

apply(embed(c(-Inf, x[ix], Inf),  3)[, 3:1], 1, least)[order(ix)]
## [1]  3 16 14  4 14

4a) This slightly shorter variation would also work:

-apply(embed(-c(-Inf, x[ix], Inf),  3), 1, least)[order(ix)]
## [1]  3 16 14  4 14

4b) Simplifying further we have the following base solution where, again, ix is from (3):

xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1]  3 16 14  4 14

Duplicates

The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1) then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.

If we wanted the interpretation of the closest not equal for those other than sqldf then we could use rle after ordering to compress it down to unique values and then use inverse.rle afterwards as shown on the modified 4b:

x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3

Is there a resource you could link that would explain why this sql query is apparently changing the value of `b.x` selected based on the fact that an additional calculated column is included? Or a term for that kind of thing? Not very knowledgeable about SQL and I've never seen a query like this where including an extra column changes the value of another column. — IceCreamToucan, Feb 21 '20 at 17:24
I am not clear on the question but what the SQL does is join DF to itself where the on condition excludes joining equal values so at this point conceptually each row is joined to each other row satisfying the on condition. It then groups by rowid of the left hand data frame and for each rowid group takes the row which minimizes abs(...). SQLite has an extension to SQL where if you use min (or max) the other elements selected are guaranteed to be from the same row as the minimizing row. The SQLite site has extensive documentation on its use of SQL. — G. Grothendieck, Feb 21 '20 at 17:54
The part where "the other elements selected are guaranteed to be from the same row as the minimizing row" is what I was curious about. Hadn't seen that before. Good to know that's specifically a SQLite extension of the standard, thanks. — IceCreamToucan, Feb 21 '20 at 18:04

score 4 · Answer 2 · answered Feb 21 '20 at 20:52

I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.

library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)

# All functions take a vector as argument, 
# and return the values of nearest neighbours (not their index)

# Using base R, by ThomasIsCoding
base_nn <- function(x) {
  d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
  id <- unlist(Map(which.min,d))
  x[id]
}

# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
  x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}

# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
  DF <- data.frame(x)   # x is from (1)
  unname(
    unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x)) 
            from DF a 
            left join DF b on a.x != b.x 
            group by a.rowid")[2])
  )
}

# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
  ix <- order(x)
  least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
  rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}

# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
  ix <- order(x)
  least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
  apply(embed(c(-Inf, x[ix], Inf),  3)[, 3:1], 1, least)[order(ix)]
}

# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
  ix <- order(x)
  least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
  -apply(embed(-c(-Inf, x[ix], Inf),  3), 1, least)[order(ix)]
}

# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
  ix <- order(x)
  xx <- x[ix]
  x1 <- c(-Inf, xx[-length(xx)])
  x2 <- c(xx[-1], Inf)
  ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}

# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
  dt <- setkey(data.table(x), x)
  dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
  dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
  dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
  unname(unlist(dt[, .(closex)]))
}

# Using, RANN, by me
rann_nn <- function(x) {
  id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
  x[id]
}



### Apply all methods

# Test that all have the same output:
x <- c(4, 14,11,3,16)

rann_nn(x)
# [1]  3 16 14  4 14
base_nn(x)
# [1]  3 16 14  4 14
rfast_nn(x)
# [1]  3 16 14  4 14
sqldf_nn(x)
# [1]  3 16 14  4 14
zoo_nn(x)
# [1]  3 16 14  4 14
base2_nn(x)
# [1]  3 16 14  4 14
base3_nn(x)
# [1]  3 16 14  4 14
base4_nn(x)
# [1]  3 16 14  4 14
dt_nn(x) # differently ordered for some reason
# [1]  4  3 14 16 14


# Compare running times
library(microbenchmark)

# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
  rann_nn(benchmark_data),
  base_nn(benchmark_data),
  rfast_nn(benchmark_data),
  sqldf_nn(benchmark_data),
  zoo_nn(benchmark_data),
  base2_nn(benchmark_data),
  base3_nn(benchmark_data),
  base4_nn(benchmark_data),
  dt_nn(benchmark_data)
)
# Unit: microseconds
#                      expr        min          lq        mean      median          uq        max neval
#   rann_nn(benchmark_data)    641.180    684.1975    776.5467    711.6680    775.3635   3822.023   100
#   base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712   100
#  rfast_nn(benchmark_data)  45160.603  47032.5225  47681.0557  47594.0075  48308.8440  50579.839   100
#  sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960   100
#    zoo_nn(benchmark_data)   4359.359   4604.0275   5008.4291   4785.1515   5037.9705  14999.802   100
#  base2_nn(benchmark_data)   1292.322   1407.4875   1747.8404   1462.7295   1588.1580  11297.321   100
#  base3_nn(benchmark_data)   1263.644   1396.9210   1615.7495   1472.9940   1571.8575  11828.015   100
#  base4_nn(benchmark_data)    119.543    146.1080    254.5075    178.1065    197.4265   7726.156   100
#     dt_nn(benchmark_data)   5290.337   6580.6965   7111.1816   6892.3800   7351.3795  29469.815   100


# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for 
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
  rann_nn(benchmark_data),
  zoo_nn(benchmark_data),
  base2_nn(benchmark_data),
  base3_nn(benchmark_data),
  base4_nn(benchmark_data),
  dt_nn(benchmark_data)
)
# Unit: milliseconds
#                      expr        min        lq      mean    median        uq      max neval
#   rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882   100
#    zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397   100
#  base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470   100
#  base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740   100
#  base4_nn(benchmark_data)   9.354764  10.46687  17.22086  12.36354  14.22882 166.4758   100
#     dt_nn(benchmark_data)  96.503882 104.06914 117.95408 108.20284 121.11428 247.2092   100

score 3 · Answer 3 · answered Feb 21 '20 at 15:19

3

Here is a base R solution

d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]

such that

> ids
X1 X2 X3 X4 X5 
 4  5  2  1  2 

> val
[1]  3 16 14  4 14

DATA

x <- c(4, 14,11,3,16)

answered Feb 21 '20 at 15:19

ThomasIsCoding

96,636
9
24
81

If I might ask: What is this "backtick" function<-( call thing (`diag<-`) called? Iv'e never come across that. And I struggled quite a while at exactly this step (how to get rid of the 0s) when trying to find a good answer for the question here... – dario Feb 21 '20 at 16:03
`\`diag<-\`(mat, someval)` is the same as `diag(mat) <- someval`, but the resulting modified `mat` is returned in the backtick version. This is referred to as the "replacement form" in `help(diag)` – IceCreamToucan Feb 21 '20 at 16:06

score 3 · Answer 4 · answered Feb 21 '20 at 15:56

Option with a data.table non-equi-join

dt <- setkey(data.table(x), x)

dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]

#     x closex
# 1:  3      4
# 2:  4      3
# 3: 11     14
# 4: 14     16
# 5: 16     14

Closest other Value in the same Vector

4 Answers4

Duplicates

Linked