26

I have two integer/posixct vectors:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

Now my resulting vector c should contain for each element of vector a the nearest element of b:

c <- c(4,4,4,4,4,6,6,...)

I tried it with apply and which.min(abs(a - b)) but it's very very slow.

Is there any more clever way to solve this? Is there a data.table solution?

Henrik
  • 65,555
  • 14
  • 143
  • 159
MikeHuber
  • 575
  • 2
  • 7
  • 13
  • If it's sorted like in your example it's just one pass through the bigger vector, keeping track of closest element in b manually, otherwise use binary search hinted above. – Łukasz Grad Apr 18 '17 at 11:56

6 Answers6

42

As it is presented in this link you can do either:

which(abs(x - your.number) == min(abs(x - your.number)))

or

which.min(abs(x - your.number))

where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.

For example:

x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))

would output:

[1] 21 22

Update: Based on the very kind comment of hendy I have added the following to make it more clear:

Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:

x <- seq(from = 100, to = 10, by = -5)
x
[1] 100  95  90  85  80  75  70  65  60  55  50  45  40  35  30  25  20  15  10

Now let's find the number closest to 42:

your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]

which would output the "value" we are looking for from the x vector:

[1] 40
Mehrad Mahmoudian
  • 3,466
  • 32
  • 36
  • 4
    Can you extend this easily to have your.number as a vector – Cyrillm_44 Aug 22 '19 at 00:34
  • @B.Quaink it logically shouldn't give you the wrong answer as long as you are dealing with Real Numbers. Can you post your numeric vector and your target vector? – Mehrad Mahmoudian Jul 28 '20 at 12:06
  • @MehradMahmoudian I converted my matrix to a vector and that seemed to do the trick. I think it took the whole matrix as one instead of looking at each numeric. Thanks – B.Quaink Jul 28 '20 at 15:12
  • 1
    Just a comment for any who are confused about which.min returning an *index*. I think the problem as asked, and as answered, are slightly ambiguous. The answer says the result should contain "each element of vector"... my face-value read is that an "element" is the "value", not the "index of the value." Due to using vectors where index==value, this might cause confusion. For example, `x <- c(1, 5, 10, 25, 30)`, `your.number <- 21.5` and `which.min(abs(x - your.number))` will return 4. Just making sure this is clear to readers. – Hendy Dec 07 '22 at 17:23
  • 1
    @Hendy Thanks, I thought it is clear because `which` always returns the index and not the value, and here we are using `which.min` or `which`. Anyways, I can see that this can be confusing for some people, therefore, I have updated the post to reflect your input (and of course with proper credit to you) :) Cheers. – Mehrad Mahmoudian Dec 10 '22 at 11:43
  • @Cyrillm_44 I think what you need is a simple sapply: `sapply(your.numberS, function(y){ which(abs(x-y) == min(abs(x-y))) })` – Mehrad Mahmoudian Dec 10 '22 at 11:49
11

Not quite sure how it will behave with your volume but cut is quite fast.

The idea is to cut your vector a at the midpoints between the elements of b.

Note that I am assuming the elements in b are strictly increasing!

Something like this:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)

cut(a, breaks=cuts, labels=b)
# [1] 4  4  4  4  4  6  6  6  10 10 10 10 10 16 16
# Levels: 4 6 10 16

This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).

findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4

So of course you can do something like:

index = findInterval(a, cuts)
b[index]
# [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16

Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.

asachet
  • 6,620
  • 2
  • 30
  • 74
5
library(data.table)

a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))

a[,merge:=Value]

b=data.table(Value=c(4,6,10,16))

b[,merge:=Value]

setkeyv(a,c('merge'))

setkeyv(b,c('merge'))

Merge_a_b=a[b,roll='nearest']

In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.

Rohit Mishra
  • 441
  • 4
  • 17
  • 3
    Welcome to Crossvalidated. Thank you for your answer. Can you extend your answer by explaining the code? – Ferdi Apr 18 '17 at 09:37
  • 2
    In Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. Size of the resultant data table will be equal to size of b (which ever is within the bracket). I requires a common key for merging as usual. – Rohit Mishra Apr 19 '17 at 07:10
  • Could you update the answer so that it produces a vector of length `a` as the OP asked? – Jonas Lindeløv Jan 25 '21 at 22:59
4

For those who would be satisfied with the slow solution:

sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
3

Here might be a simple base R option, using max.col + outer:

b[max.col(-abs(outer(a,b,"-")))]

which gives

> b[max.col(-abs(outer(a,b,"-")))]
 [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)

To get around this we can lapply over your a list, and find the closest.

library(DescTools)

lapply(a, function(i) Closest(x = b, a = i))

You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).

To get around this, put either min or max around the result:

lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))

Then unlist the result to get a plain vector :)

morgan121
  • 2,213
  • 1
  • 15
  • 33