32

Say I have a list [1,2,3,4,5,6,7]. I want to find the 3 closest numbers to, say, 6.5. Then the returned value would be [5,6,7].

Finding one closest number is not that tricky in python, which can be done using

min(myList, key=lambda x:abs(x-myNumber))

But I am trying not to put a loop around this to find k closest numbers. Is there a pythonic way to achieve the above task?

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
frazman
  • 32,081
  • 75
  • 184
  • 269

4 Answers4

48

The short answer

The heapq.nsmallest() function will do this neatly and efficiently:

>>> from heapq import nsmallest
>>> s = [1,2,3,4,5,6,7]
>>> nsmallest(3, s, key=lambda x: abs(x - 6.5))
[6, 7, 5]

Essentially this says, "Give me the three input values that have the smallest absolute difference from the number 6.5".

Optimizing for repeated lookups

In the comments, @Phylliida, asked how to optimize for repeated lookups with differing start points. One approach would be to pre-sort the data and then use bisect to locate the center of a small search segment:

from bisect import bisect

def k_nearest(k, center, sorted_data):
    'Return *k* members of *sorted_data* nearest to *center*'
    i = bisect(sorted_data, center)
    segment = sorted_data[max(i-k, 0) : i+k]
    return nsmallest(k, segment, key=lambda x: abs(x - center))

For example:

>>> s.sort()
>>> k_nearest(3, 6.5, s)
[6, 7, 5]
>>> k_nearest(3, 0.5, s)
[1, 2, 3]
>>> k_nearest(3, 4.5, s)    
[4, 5, 3]
>>> k_nearest(3, 5.0, s)
[5, 4, 6]
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • Then, why does the documentation state: "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the `sorted()` function"? To me this is equivalent to saying `nsmallest` is worse than O(nlogn) in asymptotic complexity, but happens to be more efficient for a range of small `n`s. Either the algorithm isn't as efficient as you state, or the documentation is wrong in some way. – Bakuriu Jun 09 '14 at 10:51
  • 4
    @Bakuriu: `nsmallest()` has a time complexity of O(_n_ log _k_), where _n_ is the size of container and _k_ is the number of smallest values you are looking for. The complexity of `sorted()` is O(_n_ log _n_), but it has a better constant factor. So if _k/n_ is small compared to 1, `nsmallest()` will win. If _k_ becomes bigger compared to _n_, at some point `sorted()` will win due to the better constant. Neither Raymond nor the documentation are wrong, though the documentation could be clearer about that "smaller" actually means smaller compared to the size of the collection. – Sven Marnach Jun 09 '14 at 12:15
  • @SvenMarnach: Linear time selection is possible as well, which the documentation neglects. – Neil G Jun 09 '14 at 17:27
  • @NeilG: `nsmallest()` returns the _k_ smallest elements in sorted order, and the fastest algorithm to do this is O(_n_ log _k_). I believe it's still the fastest algorithm if you drop the requirement that the result is sorted. You can select the kth smallest element in O(_n_), but I fail to see how you could select the k smallest elements. – Sven Marnach Jun 11 '14 at 22:07
  • @SvenMarnach: No, the fastest algorithm to do that is O(n). You can select the kth smallest element in linear time, and you can compare the elements in the original array against it in linear time. The total algorithm is therefore linear time. – Neil G Jun 11 '14 at 22:33
  • @RaymondHettinger: Yes, but as k grows, then the selection algorithm will beat the heap algorithm. It is linear time. (I think the constant is 6, but I forget.) – Neil G Jun 12 '14 at 02:12
  • @RaymondHettinger: Please let's not get into a long discussion where we're talking past each other. We can both agree that selection is better when k is large compared with n, say n/4. The heap algorithm is better when k is small. There are multiple selection algorithms including [Rivest's](http://en.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm), [quickselect](http://en.wikipedia.org/wiki/Quickselect), and [median-of-medians](http://en.wikipedia.org/wiki/Median_of_medians). – Neil G Jun 12 '14 at 21:14
  • I believe median-of-medians does about 10n total comparisons (with good locality of reference and parallelizability). Quickselect and Rivest's are fast on average. If you want the result in sorted order, then you need to pay klogk regardless of which algorithm you choose, but sorting has nothing to do with the posted question. – Neil G Jun 12 '14 at 21:21
  • And just to be clear, only quickselect has a "horrific worst-case". The other two algorithms have a linear or n+root(n) worst case. None of these three selection algorithms have "catastrophic cache performance" as they all have excellent locality of reference. If anything, the heap is worst in this regard as bubbling up an element to the top (which is rare in the average case, sure) hits O(logk) cache pages. – Neil G Jun 12 '14 at 21:34
  • If I need to do lots of queries (something like `[nsmallest(3, s, key=lambda x: abs(x-v)) for v in arr]`) is there a way to reuse the previous results to speed this up? I have about 1 million elements in s and 1 million elements in v – Phylliida Mar 18 '18 at 20:46
  • @Phylliida I updated the answer to include this case as well. – Raymond Hettinger Mar 19 '18 at 00:19
5

You could compute distances, and sort:

[n for d, n in sorted((abs(x-myNumber), x) for x in myList)[:k]]

This does the following:

  1. Create a sequence of tuples (d, x) where d is the distance to your target
  2. Select the first k elements of that list
  3. Extract just the number values from the result, discarding the distance
Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
2

Both answers were good, and Greg was right, Raymond's answer is more high level and easier to implement, but I built upon Greg's answer because it was easier to manipulate to fit my need.

In case anyone is searching for a way to find the n closest values from a list of dicts.

My dict looks like this, where npi is just an identifier that I need along with the value:

mydict = {u'fnpi': u'1982650024',
 u'snpi': {u'npi': u'1932190360', u'value': 2672},
 u'snpis': [{u'npi': u'1831289255', u'value': 20},
  {u'npi': u'1831139799', u'value': 20},
  {u'npi': u'1386686137', u'value': 37},
  {u'npi': u'1457355257', u'value': 45},
  {u'npi': u'1427043645', u'value': 53},
  {u'npi': u'1477548675', u'value': 53},
  {u'npi': u'1851351514', u'value': 57},
  {u'npi': u'1366446171', u'value': 60},
  {u'npi': u'1568460640', u'value': 75},
  {u'npi': u'1326046673', u'value': 109},
  {u'npi': u'1548281124', u'value': 196},
  {u'npi': u'1912989989', u'value': 232},
  {u'npi': u'1336147685', u'value': 284},
  {u'npi': u'1801894142', u'value': 497},
  {u'npi': u'1538182779', u'value': 995},
  {u'npi': u'1932190360', u'value': 2672},
  {u'npi': u'1114020336', u'value': 3264}]}

value = mydict['snpi']['value'] #value i'm working with below
npi = mydict['snpi']['npi'] #npi (identifier) i'm working with below
snpis = mydict['snpis'] #dict i'm working with below

To get an [id, value] list (not just a list of values) , I use this:

[[id,val] for diff, val, id in sorted((abs(x['value']-value), x['value'], x['npi']) for x in snpis)[:6]]

Which produces this:

[[u'1932190360', 2672],
 [u'1114020336', 3264],
 [u'1538182779', 995],
 [u'1801894142', 497],
 [u'1336147685', 284],
 [u'1912989989', 232]]

EDIT

I actually found it pretty easy to manipulate Raymond's answer too, if you're dealing with a dict (or list of lists).

from heapq import nsmallest
[[i['npi'], i['value']] for i in nsmallest(6, snpis, key=lambda x: abs(x['value']-value))]

This will produce the same as the above output.

And this

nsmallest(6, snpis, key=lambda x: abs(x['value']-value)) will produce a dict instead.

tmthyjames
  • 1,588
  • 2
  • 22
  • 30
0

For those who wants the index instead:

def find_nearest_index(array, value, k):
    array = np.array(array)
    return np.argsort(abs(array - value))[:k]

Example:

find_nearest_index([-3,0,1,2,4,5], 0.2, 4)

# array([1, 2, 3, 0], dtype=int64)
# distance = [3.20 0.20 0.80 1.80 3.80 4.80]
Muhammad Yasirroni
  • 1,512
  • 12
  • 22