2

I'd like to compare two differently spaced datasets in python. I always want to find the closest (nearest neighbour) match and regrid the data, see this example:

Dataset A:

ALTITUDE[m]   VALUE
1.            a1
2.            a2
3.            a3
4.            a4

Dataset B:

ALTITUDE[m]   VALUE
0.7           b1
0.9           b2
1.7           b3
2.            b4
2.4           b5
2.9           b6
3.1           b7
3.2           b8
3.9           b9
4.1           b10

ai and bi contain double numbers, but also nan fields.

I'd like to transform dataset B to the altitude grid of dataset A, but since dataset A contains less altitude levels than dataset B, I'd like to average them.

ALTITUDE[m]   VALUE
1.            median(b1,b2)
2.            median(b3,b4,b5)
3.            median(b6,b7,b8)
4.            median(b9,b10)

i.e. the closest altitude levels have been found and averaged over.

Conversely, if I want to match dataset A to the grid of dataset B, dataset A should look like this (nearest neighbour):

ALTITUDE[m]   VALUE
0.7           a1
0.9           a1
1.7           a2
2.            a2
2.4           a2
2.9           a3
3.1           a3
3.2           a3
3.9           a4
4.1           a4

Maybe this even has a name (I imagine it being a common problem), but I don't know it and thus cannot search for it. I believe there is an efficient way of doing this, apart from the obvious solution coding it myself (but I'm afraid it won't be efficient and I'd introduce many bugs).

Preferably using numpy.

EDIT: Thanks for your input to all four contributors. I learned a bit and I apologize for not asking very clearly. I was myself in the progress of understanding the problem. Your answers pointed me towards the usage of interp1d and this answer allowed me to abuse it for me. I will post the result shortly. I can accept only one answer, but anyone would do.

Community
  • 1
  • 1
Sebastian
  • 1,408
  • 1
  • 20
  • 28
  • Could you clarify what you mean by 'closest'? Do you want the two (or three) closest points? Or anything within a certain range? The nearest neighbours above and below the value perhaps? – Daan Jun 10 '11 at 15:03
  • I would not call b1 the nearest neighbour of a1 since b2 is closer, yet you list both b1 and b2 as the nearest neighbour of a1. What is your definition of nearest neighbour? – Daan Jun 10 '11 at 15:16
  • @Daan, you are right, I wasn't very precise. If there are more altitude points on the target grid, I'd like to simply select the nearest neighbour (second case). Iff the target grid contains less altitude points than the transformed one (first case), I'd like to take the median. – Sebastian Jun 14 '11 at 07:30
  • I'm sorry I can't help you then. I understand the second case, but I still do not see what logic you use to come up with the arguments to the median function in the first case (except if we define a range, but you have indicated that that is not correct). Hopefully someone else does. – Daan Jun 14 '11 at 07:58

5 Answers5

2

Two assumptions: 1: You are not looking for the nearest neighbour, but for all altitudes within some range. So, let's say for a1 you want all bn that are within 0.5 of a1 (giving you b1 and b2 as per your example). I would define the 'nearest neighbour' as something different.

2: You don't count nan in your medians (numpy counts them as infinity as per some IEEE convention, but this seems odd to me). As per your suggestion we thus use nanmedian from scipy.stats.

I would do the following:

from numpy import *
from pylab import *

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

range = .5

B_Agrid = [nanmedian(B_Val[abs(B_Alt - k)<range]).item() for k in A_Alt]
A_Bgrid = [nanmedian(A_Val[abs(A_Alt - k)<range]).item() for k in B_Alt]    

We find all indices where the distance of B_Alt to k in A_Alt is less than a specified range. Then we take the median of those B_Val. The same works for A_Bgrid with the results as requested.

==Edit==

Different assumption as to your nearest neighbours: Let's take the nearest neighbour to be the entry (or entries in case of a tie) with the smallest absolute altitude difference while not having nan as a value. N.B. these results do not match your example as b1 would not be the nearest neighbour of a1 on account of b2 being closer.

Under this assumption, the following code should work:

from numpy import *
from pylab import *
from scipy.stats import nanmedian

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

def ReGridMedian(AltIn, ValIn, AltOut):
    part = isfinite(ValIn)
    q = [abs(AltIn[part]-k) for k in AltOut]
    q = [nonzero(abs(k - min(k))<3*finfo(k.dtype).eps) for k in q]
    q = [ValIn[part][k] for k in q]
    return [median(k) for k in q]

B_Agrid = ReGridMedian(B_Alt, B_Val, A_Alt)    
A_Bgrid = ReGridMedian(A_Alt, A_Val, B_Alt)

I hacked together something that checks if two values are identical within machine precision, but I assume there's a better way of doing that. In any case, we first filter all values that are not nan, then find the closest match, then check for duplicate minima, then get the median of those values.

====

Does this cover your question, or are my assumptions incorrect?

Daan
  • 2,049
  • 14
  • 21
  • try [`nanmedian`](http://help.scilab.org/docs/5.3.2/en_US/nanmedian.html): For a vector or a matrix `x`, `[m]=nanmedian(x)` returns in the vector m the median of the values (ignoring the NANs) of vector `x`. – Sebastian Jun 10 '11 at 15:10
  • definitely a good solution, nevertheless I'd like to solve this without a range, but the nearest neighbour. – Sebastian Jun 10 '11 at 15:13
  • I've changed it to a nanmedian, but as I have commented on your question, I can not come up with an interpretation of nearest neighbour that matches your examples. I may very well be misinterpreting your question. – Daan Jun 10 '11 at 15:21
1

Have a look at numpy.interp:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html

(EDIT: numpy.interp only provides linear interpolation which, evidently, is not what the OP is looking for. Instead use the scipy methods like interp1d using kind='nearest')

http://docs.scipy.org/doc/scipy/reference/interpolate.html

What it sounds like you want to do is use the altitude points of one data set to interpolate the values of the other. This can be done pretty easily with either the numpy method or one of the scipy interpolation methods.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Hi Josh, thanks for the quick answer. No, I do not want to do interpolation, but **re-gridding**. The values do not need to be interpolated, just averaged if there are more altitude points in the *starting grid* than the *target grid* – Sebastian Jun 10 '11 at 13:56
  • but actually: the function [griddata](http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.griddata.html#scipy.interpolate.griddata) looks interesting.... I'll need to review it myself – Sebastian Jun 10 '11 at 13:59
  • Within the scipy interpolation module, nearest-neighbor interpolation is one of the options in all of the methods, so that is probably your best bet. The numpy `interp` method only does linear interpolation, so perhaps that isn't the best suggestion. – JoshAdel Jun 10 '11 at 14:06
  • I still don't quite get the syntax of `griddata`. Will try more, but for now it seems that it cannot find *edge* points. – Sebastian Jun 10 '11 at 15:15
  • `griddata` won't do what you need -- it does interpolation, but not smoothing (which is what you effectively want to do here). – pv. Jun 10 '11 at 15:19
  • this helped me the most, thanks @Josh. I used `interp1d` and [this answer](http://stackoverflow.com/questions/2745329/how-to-make-scipy-interpolate-give-a-an-extrapolated-result-beyond-the-input-rang/2745496#2745496) to extrapolate (find nearest neighbour outside of the range) – Sebastian Jun 14 '11 at 09:34
  • @Sebastian: Could you perhaps post your solution? (preferably as a complete example with some sample data that we can copy/paste/execute) – Daan Jun 14 '11 at 10:04
  • @Daan, please see below. As I realized that there is not a single conversion capable of both A->B and B->A, I will just wrap if/else around it. – Sebastian Jun 14 '11 at 11:16
  • OK thanks. What about the other case? Going from many altitudes to only a few? I'm still hoping to figure out what your desired implementation is :) – Daan Jun 14 '11 at 12:21
1

This is not exactly the answer you were looking for, but this is my 50c answer...

A = {1:'a1',2:'a2',3:'a3',4:'a4'}
B = {0.7:'b1',0.9:'b2',1.7:'b3',2:'b4', 2.4:'b5'}

C = {} # result

# find altitude in A that is the closest to altitude in B
def findAltitude( altB,A):
    toto = [ ((alt-altB)**2,alt) for alt in A.keys() ]
    toto.sort()
    return toto[0][1]

#iter on each altitude of B
for altB,valueB in B.iteritems():
    altC = findAltitude( altB,A)
    if altC in C:
        C[altC].append(valueB)
    else:
        C[altC] = [valueB,]

# then do the median operation
#for altC,valueC in C.iteritems():
#   C[altC] = map( median, valueC ) # where median is your median function

print C

It is NOT the best solution at all (specially if you have a lot of values), but only the fastest to write...

In fact, it depends how your datas are stored. Dictionnary is not the best choice.

It is more interesting/clever to use the fact that your altitudes are sorted. You should provide more details on how your datas are stored (array with numpy ?)

==== Edit ====

I still do not know how your datas are, but let's try something more "clever", based on the fact that your altitudes are sorted.

from numpy import *
from pylab import *
from scipy.stats import nanmedian

# add val into C at the end of C or in the last place (depending if alt already exists in C or not)
def addto(C,val,alt):
    if C and C[-1][0]==alt:
        C[-1][1].append(valB)
    else:
        C.append( (alt,[valB,] ))



# values
A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

#intermediate list of tuple (altitude, list_of_values)
C= []

#iterator on A
Aa = iter(A_Alt)
ainf = Aa.next()
asup = Aa.next()  # two first values of A_Alt

#iterator on B
Ba = iter(B_Alt)
Bv = iter(B_Val)

# regrid
try:
    while True:
        altB = Ba.next()
        valB = Bv.next()

        # find ainf and asup in A_Alt such that ainf < altB < asup
        while asup<altB:
            try:
                ainf,asup = asup, Aa.next()
            except StopIteration:
                break

        # find closest
        if abs(ainf-altB)<=abs(asup-altB):
            addto(C, valB, ainf)
        else:
            addto(C, valB, asup)

except StopIteration:
    pass

# do the median
res = [ nanmedian(k[1]) for k in C ] 

print res

The idea is then to iterate over the two vectors/lists of altitudes, and for each altitude of B, find the two altitudes of A that surround it. Then, it is easy to find the closest...

This is less readable than Daan's solution, but it should be more efficient (linear in the size of your datas).

You just need to modify if your datas are not stored like that.

ThibThib
  • 8,010
  • 3
  • 30
  • 37
1

Here's one way:

import numpy as np

def regrid_op(x, y, xi, op=np.median):
    x, y, xi = np.atleast_1d(x, y, xi)
    if (x.ndim, y.ndim, xi.ndim) != (1, 1, 1):
        raise ValueError("works only for 1D data")

    yi = np.zeros(xi.shape, dtype=y.dtype)
    yi.fill(np.nan)

    # sort data
    j = np.argsort(x)
    x = x[j]
    y = y[j]

    # group items by nearest neighbour
    x0s = np.r_[xi, np.inf]
    xc = .5*(x0s[:-1] + x0s[1:])

    j0 = 0
    for i, j1 in enumerate(np.searchsorted(x, xc)):
        print "x =", xi[i], ", y =", y[j0:j1] # print some debug info
        yi[i] = op(y[j0:j1])
        j0 = j1

    return yi

xi = np.array([1, 2, 3, 4])
x = np.array([0.7, 0.9, 1.7, 2.0, 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
y = np.array([1,   2,   3,   4,   5,   6,   7,   8,   9,   10.])

print regrid_op(x, y, xi)

I don't see a way to vectorize the loop over items in the xi array, so this should be efficient provided the number of points in grid A is not too large.

EDIT: This also assumes that the points in xi are sorted.

pv.
  • 33,875
  • 8
  • 55
  • 49
0

One way to cover the second case (Grid B to A, i.e. from few altitudes to many altitudes) is this:

Extrapolation function (from here)

from scipy.interpolate import interp1d

def extrap1d(interpolator):
    xs = interpolator.x
    ys = interpolator.y

    def pointwise(x):
        if x < xs[0]:
            return ys[0]
        elif x > xs[-1]:
            return ys[-1]
        else:
            return interpolator(x)

    def ufunclike(xs):
        return array(map(pointwise, array(xs)))

    return ufunclike

Values

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])

Actual regridding:

f_i = interp1d(A_Alt, A_Val, kind='nearest')
f_x = extrap1d(f_i)

f_x(B_Alt)

Output:

array([ 0.33,  0.33,  0.5 ,  0.5 ,  0.5 ,  0.6 ,  0.6 ,  0.6 ,  0.8 ,  0.8 ])
Community
  • 1
  • 1
Sebastian
  • 1,408
  • 1
  • 20
  • 28
  • If you want more efficiency, you can do something like `ip = interp1d(x, y, bounds_error=False, fill_value=np.nan); yi = interpolator(xi); yi[xi < x[0]] = y[0]; yi[xi > x[-1]] = y[-1]`. – pv. Jun 14 '11 at 16:23