Populate a dict with missing values

Question

Let b be a dict with some values :

 b = {}  
 b[90, 1] = 100, b[90, 55] = 101, b[90, 127] = 102
 b[70, 1] = 40, b[70, 45] = 41, b[70, 107] = 42

How to, in one pass, populate a dict with its missing values as nearest neighbour, for example for 0 <= i <= 127, 0 <= j <= 127? (it will give 16384 keys in the dict, so that's ok for my application).

As an example, I would like b[73, 40] = b[70, 45] = 41, i.e.nearest neighboor in a 2D plane.

Here is what I tried :

for i in range(127):
  for j in range(127):
    closest_key = min(b.keys(), key=lambda c: (c[0] - i) ** 2 + (c[1] - j) ** 2)
    b[i, j] = b[closest_key]

But it is very slow probably because there are 127*127 loops in which we loop again once over all elements to compute the distance!

How can we populate a dict with missing values, with nearest neighbour, in a more efficient way?

JuniorCompressor · Answer 1 · 2015-03-19T21:21:54.433

You are searching inside b for the closest key. But b doesn't contain only the original keys but also the new keys you are entering at each iteration. It will be faster and more correct to just check among the initial keys:

initial_keys = set(b.keys())
for i in xrange(127):
    for j in xrange(127):
        if (i, j) not in initial_keys:
            closest_key = min(
                initial_keys, key=lambda c: (c[0] - i) ** 2 + (c[1] - j) ** 2
            )
            b[i, j] = b[closest_key]

This way the running time of the algorithm drops to O(k * n^2) from O(n^4), where n is the dimension size and k the number of initial keys.

EDIT:

You can use numpy with great speedup improvement:

import numpy as np

s = set(b.keys())
x = np.array([k[0] for k in s])
y = np.array([k[1] for k in s])
for i in xrange(128):
    for j in xrange(128):
        if (i, j) not in s:
            argmin = np.argmin((x - i) ** 2 + (y - j) ** 2)
            b[i, j] = b[x[argmin], y[argmin]]

If you have only a few initial values, it's an O(n^2) solution. — JuniorCompressor, Mar 17 '15 at 14:44
I have something like k=2000 initial values and at the end there will be 16384 values b[i,j]... I did some tests on my RaspPi, O(k*n²) is still too big (it runs for 4 seconds, it should be < 100ms) — Basj, Mar 17 '15 at 15:17

loopbackbee · Answer 2 · 2015-03-17T14:04:40.600

1

A dictionary is absolutely inappropriate for this kind of use - unless you're happy with O(n) complexity (and then, using a list would be more clear). There is a class of hashing functions that could, conceivably, be used to implement a appropriate "dictionary" - but python's dict is definitely not up to the task.

If you need proper performance, you'll need to use another data structure. The simplest one would be a K-d tree. There's an implementation inside scipy.

You may want to review the wikipedia article devoted to nearest neighbor search

Of course, you can use a dictionary as a cache if you're querying the same values repeatedly (as in Raydel Miranda's answer). But use it as a cache - not for storing/querying your actual data!

edited Mar 17 '15 at 14:04

answered Mar 17 '15 at 13:46

loopbackbee

21,962
10
62
97

Once the dict will be populated with `b[i,j]` *for all* values `1 <= i <= 127`, `1 <= j <= 127`, then there will be no more problems. So the only problem is, just in one pass, to populate the dict – Basj Mar 17 '15 at 14:09
@Basj indeed - as you've found, the problem is that it's not *practical* to populate the `dict` with **all** values - because that grows exponentially in the number of dimensions. You're trying to solve a `O(n)` problem in `O(1)` by performing an additional `O(n^2)` step :) – loopbackbee Mar 17 '15 at 14:26
Maybe you're right @goncalopp, but indeed, I prefer 1) Having `O(n^2)` **once** at the start of my program, and then `O(1)` each time I have to access the data (hundreds of time per second), then 2) Nothing on loading, and then `O(n)` each time I have to access the data. – Basj Mar 17 '15 at 14:37
@Basj that sounds appropriate. In that case, the solution you presented in your question is as good as you're going to get. Just notice that, if you implement a cache, you don't need to populate the `dict` with all values, and you'll still get `O(1)` performance if you get repeated queries (if you don't get repeated queries, there's no point at all in building the dictionary in the first place! unless you have hard timing requirements, as in a real-time system) – loopbackbee Mar 17 '15 at 15:01

score 0 · Answer 3 · answered Mar 17 '15 at 13:21

You could try to calculate on demand, and build a cache with results. The advantage of this approach is that if you don't need to use some point it will never be calculated.

Full example:

b = {}  
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42

class NeighbourDist:

  def __init__(self, source_dict):
    # Original dict.
    self.__source_dict = source_dict
    # Dict used for cache.
    self.__cache_dict = {}


  def __calculate_distance(self, x0, x1, y0, y1):
    """ Calculate distance beetwen two points. """
    dx = x1 - x0
    dy = y1 - y0
    d = (dx**2 + dy**2)**0.5
    return d

  def __getitem__(self, key):
    """
    Look for the key in the cached dict, if not has been calculated yet
    then proceed to calculate it.
    Return the result and store in __cache_dict.
    """
    cached = self.__cache_dict.get(key)
    if cached is not None:
      return cached
    else:
      x0, y0 = key
      min_n = 0
      min_  = 1e100
      for (x1, y1) in self.__source_dict.keys():
        dist = self.__calculate_distance(x0, x1, y0, y1)
        if min_ > dist:
          min_ = dist
          min_n = self.__source_dict[x1, y1]
      self.__cache_dict[key] = min_n
      return min_n

if '__main__' == __name__:
  d = NeighbourDist(b)
  print(d[73, 40]) # >>> 41
  print(d[73, 40]) # >>> 41, Second time the result is obtained from the cached dict.

Populate a dict with missing values

3 Answers3