8

I have a list of length 2016, but only 242 contain data, the rest is set to None. My aim is to interpolate between the values to fill all gaps with a simple form of IDW (inverse distance weighting). So the task of my script is:

  • Iterate over all items of myList
  • If the myList contains a value (that is not None), just copy it
  • If you find a "None" in myList, get position/value of the left and right neighbor by calculating the distance to all items in myList
  • Calculate an interpolated value for the gap from both neighbors (the farer they are away, the less weight they shall get)

Assume we have a smaller list of only 14 items (5 valid ones):

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]
resultList = [None] * len(myList)

for i in range(len(myList):
    if not myList[i] is None:
        resultList[i] = myList[i]
    else:
        distance = [i - j for j in range(len(myList)) if not myList[j] is None]
        neighbors = min([n for n in dist if n>0]), max([n for n in dist if n<0])
        # rest of the interpolation (not important for my question):
        neighbors_c = [(1/float(n))**2 for n in neighbors]
        c_sum = sum(neighbors_c)
        neighbors_c = [n/c_sum for n in neighbors_c]
        resultList = myList[i-neighbors[0]]*neighbors_c[0] + myList[i-neighbors[1]]*neighbors_c[1]

I am doing that for many many data sets. I found out that this method takes about 0.59sec per data set. What bothers me is the fact that my list is all sorted, but I will only need 2 values from it. So 99% of the distances are calculated for nothing. That led me to attempt two: stop the iteration after i-j becomes negative, because then obviously it ran into the closest values:

So instead of the list comprehension:

distance = [i - j for j in range(len(myList)) if not myList[j] is None]

I do a proper for-loop which I quit after distances pass zero and thus become larger again:

dist = []
for j in range(len(myList)):
    if not myList[j] is None:
        dist.append(i-j)
        if i-j < 0: break

With this method I was able to get down to 0.38sec per data set. When iterating over all items in myList, this second method is quick at the beginning (item is hit after 2nd, 3rd, 4th, ... loop and quit immediately), but there is no improvement for the last items, since iteration always starts at j=0.

I wonder if you can think of any quicker way to find the two neighbors of a specific number in a data set, without having to check ALL distances and only taking the largest negative and smalles positive one.

Also, I'm quite new to python, so please let me know if you find other un-pythonic expressions in my script. Thank you guys very much!

offeltoffel
  • 2,691
  • 2
  • 21
  • 35
  • 1
    Numpy delivers some nearest-neighbor-algorithms you might have a look at [them](http://docs.scipy.org/doc/scipy/reference/spatial.html#nearest-neighbor-queries) – albert Dec 14 '15 at 12:26
  • 1
    And there is `pandas.Series.interpolate` function that does all of it. – pacholik Dec 14 '15 at 12:43
  • How about any answer to the question [Inverse Distance Weighted (IDW) Interpolation with Python](http://stackoverflow.com/questions/3104781/inverse-distance-weighted-idw-interpolation-with-python)? – ojdo Dec 14 '15 at 12:56
  • Why not do one pass to find all the indices of non-None values, then a second pass to fill everything in? – Mad Physicist Dec 14 '15 at 12:59
  • albert: Nearest neighbor interpolation is something different. But I did stumble across the KDTree function which I was not able to call yet. Couldn't find an example I understand so far, but I might keep on looking if the other solutions don't do the job. pacholik: Couldn't install the pandas tool kit for some reason. Seems to interfer with numpy ojdo: classic IDW interpolation takes all other data points into account. I'd like to restrict it to the two neighbors. I call it IDW, but in fact it's something different I'm afraid. Mad: I did exactly that, didn't I? – offeltoffel Dec 14 '15 at 13:23
  • @albert: I'm still stuck on the cKDTree function. Seems, I can't find any example on the net that I properly understand. Given a list (myList), how can I use cKDTree to return the two closest neighbors? – offeltoffel Dec 14 '15 at 14:40
  • Unfortunately, I never used this function but have heard from this. Sorry for that... – albert Dec 14 '15 at 14:44

1 Answers1

2

UPDATE: Here's how to do it with numpy interp:

import numpy as np

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]

values = [(i, val) for i, val in enumerate(myList) if val is not None]

xp, fp = zip(*values)

print(xp) # (0, 4, 7, 9, 13)
print(fp) # (26, 31, 58, 42, 79)

result = np.interp(np.arange(len(myList)), xp, fp)
print(result) # [ 26.    27.25  28.5   29.75  31.    40.    49.    58.    50.    42.    51.25  60.5   69.75  79.  ]

Original Post:

As others have already suggested, your be best off with using some interpolation already implemented in numpy or pandas.

However for completeness here's a a quick solution I came up with:

myList = [26, None, None, None, 31, None, None, 58, None, 42, None, None, None, 79]

resultList = []

# first lets split the list into sublists that group the numbers
# and the Nones into groups
for i, item in enumerate(myList):
    if i == 0:
        resultList.append([item])
    else:
        if type(resultList[-1][-1]) == type(item):
            resultList[-1].append(item)
        else:
            resultList.append([item])

print(resultList) # [[26], [None, None, None], [31], [None, None], [58], [None], [42], [None, None, None], [79]]

# now lets interpolate the sublists that contain Nones
for i, item in enumerate(resultList):
    if item[0] is not None:
        continue

    # this is a bit problematic, what do we do if we have a None at the beginning or at the end?
    if i == 0 or i + 1 == len(resultList):
        continue

    prev_item = resultList[i - 1][-1]
    next_item = resultList[i + 1][0]

    difference = next_item - prev_item
    item_length = len(item) + 1

    for j, none_item in enumerate(item):
        item[j] = prev_item + float(j + 1) / item_length * difference

# flatten the list back
resultList = [item for sublist in resultList for item in sublist]

print(resultList) # [26, 27.25, 28.5, 29.75, 31, 40.0, 49.0, 58, 50.0, 42, 51.25, 60.5, 69.75, 79]

I suggest you use this only for learning or for simple cases as it does not handle cases where you have lists beginning or ending with None

mirosval
  • 6,671
  • 3
  • 32
  • 46
  • Thanks for providing your two answers! The interp. tool seems to be an easy way to interpolate my data sets, but it's only linear. I need a method that takes the distances into account and uses quadratic weights. A classical IDW-approach would take way too long, hence I wanted to implement my own idea. The upper solution I need to have a closer look at. At first glance it doesn't look like it was going to be quicker, but maybe I missed something important there. No worries about first or last item being "None" - I made sure that this would never happen. – offeltoffel Dec 14 '15 at 13:16
  • 1
    Right, with the second piece you can implement the interpolation yourself, just edit the inner for loop :) – mirosval Dec 14 '15 at 13:19