Position in vector based on approximate matching

Question

I have the sorted vector

m<-c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)

I want to find the position of a value based on approximate matching. If the value does not exist in the vector i would like the position of the immediately previous value

for exact matching I would use

> match(4,m)
[1] 4

But if I do

> match(6,m)
[1] NA

What i would like to get in this example is 8 (the position of the immidiately previous value of 6 which is the position of 5.7 which is 8)

Thank you in advance

Do you want to find the index for a single value or for several values? If you need several values, see my answer below regarding using `findInterval`. — Tommy, Aug 31 '11 at 16:19
See also [How to efficiently find the index of a value in a sorted array?](https://stackoverflow.com/questions/70240851/how-to-efficiently-find-the-index-of-a-value-in-a-sorted-array) for some fast `Rcpp` alternatives. — Henrik, Dec 06 '21 at 11:37

Tommy · Answer 1 · 2011-08-31T15:46:21.367

11

There is a built-in function that does precisely what you want: findInterval ...It's vectorized as well, so you can give it several values to find in one go which is much more efficient.

m <- c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)
# Find nearest (lower) index for both 4 and 6
findInterval(c(4,6), m) 
# [1] 4 8

edited Aug 31 '11 at 15:46

answered Aug 31 '11 at 15:37

Tommy

39,997
12
90
85

1

+1 Brilliant. I knew this had to exist, but have never been able to find it. Thank you. – Andrie Aug 31 '11 at 16:31
@Andrie - Yeah, the naming of R functions isn't always obvious! – Tommy Aug 31 '11 at 16:52

John · Answer 2 · 2011-08-31T08:16:25.917

3

This should do it for you.

#v is a sorted vector and x is an item for which we want the exact or nearest lower match
lowSideMatch <- function(x, v) max(which(v <= x))

lowSideMatch(6, m)
[1] 8

lowSideMatch(4, m)
[1] 4

edited Aug 31 '11 at 08:16

answered Aug 31 '11 at 08:09

John

23,360
7
57
83

Andrie · Accepted Answer · 2011-08-31T09:18:36.960

3

Use which.max in combination with vector subsetting, a solution of 17 characters:

which.max(m[m<=6]) # Edited to specify <=
[1] 8

Since your vector is sorted, you can use the even shorter:

sum(m<=6) # Edited to specify <=
[1] 8

This works because the value TRUE is implicitly converted to 1 in the sum

edited Aug 31 '11 at 09:18

answered Aug 31 '11 at 08:14

Andrie

176,377
47
447
496

1

@ECII I just noticed a small error. I think the correct operator is `<=` instead of `<` - otherwise it will not find exact matches. Please check your own results. – Andrie Aug 31 '11 at 09:18
1

damn... I had considered sum() but thought it might be obtuse for the questioner... clarifying why was a good idea. :) BTW, max(which()) is, surprisingly to some, faster than which.max(y[]) because of the speed of []... it's almost as fast as the sum() solution – John Aug 31 '11 at 10:29

GKi · Answer 4 · 2021-12-06T10:34:32.990

Using c++ upper_bound with Rcpp might be fast.

Rcpp::cppFunction(
  "int upper_bound(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

upper_bound(4, m)
#[1] 4

upper_bound(6, m)
#[1] 8

Benchmark:

set.seed(42)
n <- 1e6
m <- sort(rnorm(n, 0, 2))

bench::mark(which_min = c(which(abs(m-4) == min(abs(m-4))), which(abs(m-6) == min(abs(m-6))))
          , which.max = c(which.max(m[m<=4]), which.max(m[m<=6]))
          , length = c(length(m[m<=4]), length(m[m<=6]))
          , max_which = c(max(which(m <= 4)), max(which(m <= 6)))
          , sum = c(sum(m<=4), sum(m<=6))
          , findIntervall = findInterval(c(4,6), m)
          , CPP_upper_bound = c(upper_bound(4, m), upper_bound(6, m)))
#  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 which_min        16.33ms  18.33ms      51.1        NA     55.0    26    28
#2 which.max         9.38ms  11.46ms      73.1        NA    101.     37    51
#3 length            8.41ms  10.06ms      79.7        NA    112.     40    56
#4 max_which         5.89ms   7.48ms     136.         NA     67.8    68    34
#5 sum               2.68ms   3.33ms     274.         NA     45.9   137    23
#6 findIntervall   679.33µs  788.1µs    1214.         NA      0     607     0
#7 CPP_upper_bound   2.31µs   2.75µs  307024.         NA     30.7 10000     1

mzuba · Answer 5 · 2011-08-31T08:19:21.493

1

you could use the which() function to get the index of the element with the smallest deviation from your searched value. Also works with unordered vectors.

x <- c(8,4,1,2,3,6,9)
find <- 6
pos <- which(abs(x-find) == min(abs(x-find)))

edited Aug 31 '11 at 08:19

answered Aug 31 '11 at 08:04

mzuba

1,226
1
16
33

Tony Breyal · Answer 6 · 2011-08-31T08:00:14.617

0

Something like:

> m<-c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)

> ( index = length(m[m<=4]) )
[1] 4

> ( m[index] )
[1] 4

> ( index = length(m[m<=6]) )
[1] 8

> ( m[index] )
[1] 5.7

edited Aug 31 '11 at 08:00

answered Aug 31 '11 at 07:53

Tony Breyal

5,338
3
29
49

`m[index]` is an element of `m`, which isn't what the OP wants. – Aug 31 '11 at 07:54
Although you do bring up an important point that a vector-based solution is better in R than a loop. Lemme edit my answer... – Aug 31 '11 at 07:57
@Jack Maney I added some output to make it more clear. I think that's what the OP wanted, but will withdraw the answer if it's still incorrect (it's early and I have a hangover from hell) :D – Tony Breyal Aug 31 '11 at 08:02

Position in vector based on approximate matching

6 Answers6

Linked