2

I have two large multidimensional arrays: Y carries three measurements of half a million objects (e.g. shape=(500000,3)) and X has same shape, but contains position of Y measurements.

At first, I would like for each row, containing an object, to fit a polynomial equation. I know that iterating over arrays are quite slow, but what I'm doing for the moment is:

fit = array([polyfit(X[i],Y[i],deg) for i in xrange(obs.shape[0])])

My question is: is there any possibility of fitting each row of both arrays without explicitly iterating over them?

Uri Agassi
  • 36,848
  • 14
  • 76
  • 93
phasselmann
  • 465
  • 2
  • 6
  • 17
  • Did you try something like apply along axis? – Donbeo Mar 19 '14 at 14:37
  • With 3 points you can only fit a linear or quadratic (exact fit) function. It is possible to calculate an explicit solution in both cases, which then can be vectorized to calculate for all rows at the same time. – Josef Mar 20 '14 at 02:58
  • Actually, three points is just for exemplification, when I really apply the method the dimension will be larger. – phasselmann Mar 20 '14 at 10:26

2 Answers2

1

It is possible to do so without iterate along the first axis. However, your second axis is rather short (being just 3), you can really fit no more than 2 coefficients.

In [67]:

import numpy as np
import scipy.optimize as so

In [68]:

def MD_ployError(p, x, y):
    '''if x has the shape of (n,m), y must be (n,m), p must be (n*p, ), where p is degree'''
    #d is no. of degree
    p_rshp=p.reshape((x.shape[0], -1))
    f=y*1.
    for i in range(p_rshp.shape[1]):
        f-=p_rshp[:,i][:,np.newaxis]*(x**i)
    return (f**2).sum()

In [69]:

X=np.random.random((100, 6))
Y=4+2*X+3*X*X
P=(np.zeros((100,3))+[1,1,1]).ravel()

In [70]:

MD_ployError(P, X, Y)

Out[70]:
11012.2067606684

In [71]:

R=so.fmin_slsqp(MD_ployError, P, args=(X, Y))
Iteration limit exceeded    (Exit mode 9) #you can increase iteration limit, but the result is already good enough.
            Current function value: 0.00243784856039
            Iterations: 101
            Function evaluations: 30590
            Gradient evaluations: 101

In [72]:

R.reshape((100, -1))

Out[72]:
array([[ 3.94488512,  2.25402422,  2.74773571],
       [ 4.00474864,  1.97966551,  3.02010015],
       [ 3.99919559,  2.0032741 ,  2.99753804],
..............................................)
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • Thanks Zhu. However, is this method faster than my single line implementation? – phasselmann Mar 20 '14 at 13:01
  • 1
    It dependents. If your can provide a function to calculate the derivative of the error function it will be faster. If the initial guess parameters are 'good-guesses', it will be faster. A different optimizer may be faster. But overall I think it may be an overkill, especially if you don't set any constraints (say the coeff's for x^2 have to be equal). I have been in the same situation and I choose to `multiprocessing` (but I was fitting a very complex model to each row in that case). After all, the fitting for any one row is completely independent of the others. – CT Zhu Mar 21 '14 at 15:25
0

Yes, if you use the new numpy polyfit from np.polynomial, not the old np.polyfit:

X = np.arange(3)
Y = np.random.rand(10000, 3)

fit = np.array([np.polyfit(X, y, 2) for y in Y])
fits = np.polynomial.polynomial.polyfit(X, Y.T, 2)

assert np.allclose(fit.T[::-1], fits)

Timing:

In [692]: timeit fit = np.array([np.polyfit(X, y, 2) for y in Y])
 1 loops, best of 3: 2.22 s per loop

In [693]:  timeit fits = np.polynomial.polynomial.polyfit(X, Y.T, 2)
100 loops, best of 3: 3.63 ms per loop
Community
  • 1
  • 1
askewchan
  • 45,161
  • 17
  • 118
  • 134
  • I think the OP meant the `X` also has the shape of `(10000,3)` and the result need to be `(10000,d)` for `d` degrees (coefficients): fit each row of `x` by each row of `y` – CT Zhu Mar 19 '14 at 16:12
  • Actually, the X array must be multidimensional too. As I comment above, I want more general solution, indeed. Dimension 3 was chosen just to formulate the question. – phasselmann Mar 20 '14 at 10:31
  • 1
    The dimensionality `3` is not the limit here, it's the shape of the arrays. That is, you have different `x` values for each row in `y`. It is common to have many sets of measurements (`y`) for one set of locations (`x`), but you have completely independent sets of measurements and locations. I don't think there is a simple way to fit them all at once using builtin polynomial fitting, you have to define your own error function as @CT did. – askewchan Mar 20 '14 at 15:09