Finding k closest numbers to a given number

Question

Say I have a list [1,2,3,4,5,6,7]. I want to find the 3 closest numbers to, say, 6.5. Then the returned value would be [5,6,7].

Finding one closest number is not that tricky in python, which can be done using

min(myList, key=lambda x:abs(x-myNumber))

But I am trying not to put a loop around this to find k closest numbers. Is there a pythonic way to achieve the above task?

Raymond Hettinger · Accepted Answer · 2021-05-02T20:49:28.720

48

The short answer

The heapq.nsmallest() function will do this neatly and efficiently:

>>> from heapq import nsmallest
>>> s = [1,2,3,4,5,6,7]
>>> nsmallest(3, s, key=lambda x: abs(x - 6.5))
[6, 7, 5]

Essentially this says, "Give me the three input values that have the smallest absolute difference from the number 6.5".

Optimizing for repeated lookups

In the comments, @Phylliida, asked how to optimize for repeated lookups with differing start points. One approach would be to pre-sort the data and then use bisect to locate the center of a small search segment:

from bisect import bisect

def k_nearest(k, center, sorted_data):
    'Return *k* members of *sorted_data* nearest to *center*'
    i = bisect(sorted_data, center)
    segment = sorted_data[max(i-k, 0) : i+k]
    return nsmallest(k, segment, key=lambda x: abs(x - center))

For example:

>>> s.sort()
>>> k_nearest(3, 6.5, s)
[6, 7, 5]
>>> k_nearest(3, 0.5, s)
[1, 2, 3]
>>> k_nearest(3, 4.5, s)    
[4, 5, 3]
>>> k_nearest(3, 5.0, s)
[5, 4, 6]

edited May 02 '21 at 20:49

answered Jun 09 '14 at 00:45

Raymond Hettinger

216,523
63
388
485

Then, why does the documentation state: "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the `sorted()` function"? To me this is equivalent to saying `nsmallest` is worse than O(nlogn) in asymptotic complexity, but happens to be more efficient for a range of small `n`s. Either the algorithm isn't as efficient as you state, or the documentation is wrong in some way. – Bakuriu Jun 09 '14 at 10:51
4

@Bakuriu: `nsmallest()` has a time complexity of O(_n_ log _k_), where _n_ is the size of container and _k_ is the number of smallest values you are looking for. The complexity of `sorted()` is O(_n_ log _n_), but it has a better constant factor. So if _k/n_ is small compared to 1, `nsmallest()` will win. If _k_ becomes bigger compared to _n_, at some point `sorted()` will win due to the better constant. Neither Raymond nor the documentation are wrong, though the documentation could be clearer about that "smaller" actually means smaller compared to the size of the collection. – Sven Marnach Jun 09 '14 at 12:15
@SvenMarnach: Linear time selection is possible as well, which the documentation neglects. – Neil G Jun 09 '14 at 17:27
@NeilG: `nsmallest()` returns the _k_ smallest elements in sorted order, and the fastest algorithm to do this is O(_n_ log _k_). I believe it's still the fastest algorithm if you drop the requirement that the result is sorted. You can select the kth smallest element in O(_n_), but I fail to see how you could select the k smallest elements. – Sven Marnach Jun 11 '14 at 22:07
@SvenMarnach: No, the fastest algorithm to do that is O(n). You can select the kth smallest element in linear time, and you can compare the elements in the original array against it in linear time. The total algorithm is therefore linear time. – Neil G Jun 11 '14 at 22:33
@RaymondHettinger: Yes, but as k grows, then the selection algorithm will beat the heap algorithm. It is linear time. (I think the constant is 6, but I forget.) – Neil G Jun 12 '14 at 02:12
@RaymondHettinger: Please let's not get into a long discussion where we're talking past each other. We can both agree that selection is better when k is large compared with n, say n/4. The heap algorithm is better when k is small. There are multiple selection algorithms including [Rivest's](http://en.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm), [quickselect](http://en.wikipedia.org/wiki/Quickselect), and [median-of-medians](http://en.wikipedia.org/wiki/Median_of_medians). – Neil G Jun 12 '14 at 21:14
I believe median-of-medians does about 10n total comparisons (with good locality of reference and parallelizability). Quickselect and Rivest's are fast on average. If you want the result in sorted order, then you need to pay klogk regardless of which algorithm you choose, but sorting has nothing to do with the posted question. – Neil G Jun 12 '14 at 21:21
And just to be clear, only quickselect has a "horrific worst-case". The other two algorithms have a linear or n+root(n) worst case. None of these three selection algorithms have "catastrophic cache performance" as they all have excellent locality of reference. If anything, the heap is worst in this regard as bubbling up an element to the top (which is rare in the average case, sure) hits O(logk) cache pages. – Neil G Jun 12 '14 at 21:34
If I need to do lots of queries (something like `[nsmallest(3, s, key=lambda x: abs(x-v)) for v in arr]`) is there a way to reuse the previous results to speed this up? I have about 1 million elements in s and 1 million elements in v – Phylliida Mar 18 '18 at 20:46
@Phylliida I updated the answer to include this case as well. – Raymond Hettinger Mar 19 '18 at 00:19

Greg Hewgill · Answer 2 · 2014-06-09T00:36:20.307

5

You could compute distances, and sort:

[n for d, n in sorted((abs(x-myNumber), x) for x in myList)[:k]]

This does the following:

Create a sequence of tuples (d, x) where d is the distance to your target
Select the first k elements of that list
Extract just the number values from the result, discarding the distance

edited Jun 09 '14 at 00:36

answered Jun 09 '14 at 00:31

Greg Hewgill

951,095
183
1,149
1,285

3

`sorted` accepts a key function, so you don't need the decorate-sort-undecorate pattern. – DSM Jun 09 '14 at 00:53
1

Oh that's true, and indeed the OP was most of the way there! Anyway, Raymond's answer is better. – Greg Hewgill Jun 09 '14 at 00:54

tmthyjames · Answer 3 · 2015-12-09T19:00:42.143

Both answers were good, and Greg was right, Raymond's answer is more high level and easier to implement, but I built upon Greg's answer because it was easier to manipulate to fit my need.

In case anyone is searching for a way to find the n closest values from a list of dicts.

My dict looks like this, where npi is just an identifier that I need along with the value:

mydict = {u'fnpi': u'1982650024',
 u'snpi': {u'npi': u'1932190360', u'value': 2672},
 u'snpis': [{u'npi': u'1831289255', u'value': 20},
  {u'npi': u'1831139799', u'value': 20},
  {u'npi': u'1386686137', u'value': 37},
  {u'npi': u'1457355257', u'value': 45},
  {u'npi': u'1427043645', u'value': 53},
  {u'npi': u'1477548675', u'value': 53},
  {u'npi': u'1851351514', u'value': 57},
  {u'npi': u'1366446171', u'value': 60},
  {u'npi': u'1568460640', u'value': 75},
  {u'npi': u'1326046673', u'value': 109},
  {u'npi': u'1548281124', u'value': 196},
  {u'npi': u'1912989989', u'value': 232},
  {u'npi': u'1336147685', u'value': 284},
  {u'npi': u'1801894142', u'value': 497},
  {u'npi': u'1538182779', u'value': 995},
  {u'npi': u'1932190360', u'value': 2672},
  {u'npi': u'1114020336', u'value': 3264}]}

value = mydict['snpi']['value'] #value i'm working with below
npi = mydict['snpi']['npi'] #npi (identifier) i'm working with below
snpis = mydict['snpis'] #dict i'm working with below

To get an [id, value] list (not just a list of values) , I use this:

[[id,val] for diff, val, id in sorted((abs(x['value']-value), x['value'], x['npi']) for x in snpis)[:6]]

Which produces this:

[[u'1932190360', 2672],
 [u'1114020336', 3264],
 [u'1538182779', 995],
 [u'1801894142', 497],
 [u'1336147685', 284],
 [u'1912989989', 232]]

EDIT

I actually found it pretty easy to manipulate Raymond's answer too, if you're dealing with a dict (or list of lists).

from heapq import nsmallest
[[i['npi'], i['value']] for i in nsmallest(6, snpis, key=lambda x: abs(x['value']-value))]

This will produce the same as the above output.

And this

nsmallest(6, snpis, key=lambda x: abs(x['value']-value)) will produce a dict instead.

score 0 · Answer 4 · answered Apr 04 '21 at 03:17

For those who wants the index instead:

def find_nearest_index(array, value, k):
    array = np.array(array)
    return np.argsort(abs(array - value))[:k]

Example:

find_nearest_index([-3,0,1,2,4,5], 0.2, 4)

# array([1, 2, 3, 0], dtype=int64)
# distance = [3.20 0.20 0.80 1.80 3.80 4.80]

Finding k closest numbers to a given number

4 Answers4

The short answer

Optimizing for repeated lookups

Linked

Related