How to efficiently interpolate data in a Pandas DataFrame row-wise?

Question

I have several thousand "observations". Each observation consists of location (x,y) and sensor reading (z), see example below.

I would like to fit a bi-linear surface to the x,y, and z data. I am currently doing it with the code-snippet from amroamroamro/gist:

def bi2Dlinter(xdata, ydata, zdata, gridrez):
    X,Y = np.meshgrid(
             np.linspace(min(x), max(x), endpoint=True, num=gridrez),
             np.linspace(min(y), max(y), endpoint=True, num=gridrez))  
    A = np.c_[xdata, ydata, np.ones(len(zdata))]
    C,_,_,_ = scipy.linalg.lstsq(A, zdata)
    Z = C[0]*X + C[1]*Y + C[2]
    return Z

My current approach is to cycle through the rows of the DataFrame. (This works great for 1000 observations but is not usable for larger data-sets.)

ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append(np.median(bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

I would be surprised if there is not a more efficient way to do this. Is there a way to vectorize the linear interpolation?

I put the code here which also generates dummy entries. Thanks

score 1 · Accepted Answer · edited Mar 28 '19 at 15:11

Looping over DataFrames like this is generally not recommended. Instead you should opt to try and vectorize your code as much as possible.

First we create an array for your inputs

x_vals = df2[['x1','x2','x3','x4','x5']].values
y_vals = df2[['y1','y2','y3','y4','y5']].values
z_vals = df2[['z1','z2','z3','z4','z5']].values

Next we need to create a bi2Dlinter function that handles vector inputs, this involves changing linspace/meshgrid to work for an array and changing the least_squares function. Normally scipy.linalg functions work over an array but as far as I'm aware the .lstsq method doesn't. Instead we can use the .SVD to replicate the same functionality over an array.

def create_ranges(start, stop, N, endpoint=True):
    if endpoint==1:
        divisor = N-1
    else:
        divisor = N
    steps = (1.0/divisor) * (stop - start)
    return steps[:,None]*np.arange(N) + start[:,None]

def linspace_nd(x,y,gridrez):
    a1 = create_ranges(x.min(axis=1), x.max(axis=1), N=gridrez, endpoint=True)
    a2 = create_ranges(y.min(axis=1), y.max(axis=1), N=gridrez, endpoint=True)
    out_shp = a1.shape + (a2.shape[1],)
    Xout = np.broadcast_to(a1[:,None,:], out_shp)
    Yout = np.broadcast_to(a2[:,:,None], out_shp)
    return Xout, Yout

def stacked_lstsq(L, b, rcond=1e-10):
    """
    Solve L x = b, via SVD least squares cutting of small singular values
    L is an array of shape (..., M, N) and b of shape (..., M).
    Returns x of shape (..., N)
    """
    u, s, v = np.linalg.svd(L, full_matrices=False)
    s_max = s.max(axis=-1, keepdims=True)
    s_min = rcond*s_max
    inv_s = np.zeros_like(s)
    inv_s[s >= s_min] = 1/s[s>=s_min]
    x = np.einsum('...ji,...j->...i', v,
                  inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
    return np.conj(x, x)

def vectorized_bi2Dlinter(x_vals, y_vals, z_vals, gridrez):

    X,Y = linspace_nd(x_vals, y_vals, gridrez)
    A = np.stack((x_vals,y_vals,np.ones_like(z_vals)), axis=2)
    C = stacked_lstsq(A, z_vals)
    n_bcast = C.shape[0]
    return C.T[0].reshape((n_bcast,1,1))*X + C.T[1].reshape((n_bcast,1,1))*Y + C.T[2].reshape((n_bcast,1,1))

Upon testing this on data for n=10000 rows, the vectorized function was significantly faster.

%%timeit
ZZ = []
for index, row in df2.iterrows():
    x=row['x1'], row['x2'], row['x3'], row['x4'], row['x5']
    y=row['y1'], row['y2'], row['y3'], row['y4'], row['y5']
    z=row['z1'], row['z2'], row['z3'], row['z4'], row['z5']
    ZZ.append((bi2Dlinter(x,y,z,gridrez)))
df2['ZZ']=ZZ

Out: 5.52 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
res = vectorized_bi2Dlinter(x_vals,y_vals,z_vals,gridrez)

Out: 74.6 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

You should pay careful attention to whats going on in this vectorize function and familiarize yourself with broadcasting in numpy. I cannot take credit for the first three functions, instead I will link their answers from stack overflow for you to get an understanding.

Vectorized NumPy linspace for multiple start and stop values

how to solve many overdetermined systems of linear equations using vectorized codes?

How to use numpy.c_ properly for arrays

Great! Thanks this answer made my day. I added a graph showing the speed up and I will definitely dig into vectorizing. — rul30, Mar 28 '19 at 13:55
Nice one, for higher orders you could also use numexpr for the create_ranges function as seen in the Vectorized NumPy linspace answer attached. — Cr1064, Mar 28 '19 at 17:27
Oh wow, you did not only make my day, you also put together a nice curriculum: vectorization, broadcasting, numexpr, einsum etc. — rul30, Mar 28 '19 at 21:39
I posted a follow-up question to this solution because the two fits produce different results when increasing the number of rows. Any idea, what could cause this? — rul30, Mar 29 '19 at 10:15
I only checked the results for 1000 rows of data and they matched. I guess I could step through the function to check where it is going off — Cr1064, Mar 29 '19 at 10:54
I was wondering if the two lstsq were using different limiters or convergence criteria? — rul30, Mar 30 '19 at 12:36
I actually never noticed but the results are off by ~e-16 for me, I agree this is probably something to do with the precision in the stacked least squares function. — Cr1064, Apr 01 '19 at 08:48
mea culpa, I just realized an error in the averaging of the vectorized function. Having fixed this gives very good results (1e-17). Sorry again. — rul30, Apr 01 '19 at 09:55

How to efficiently interpolate data in a Pandas DataFrame row-wise?

1 Answers1