2

I have a dataframe with independent variables in the column headers, and each rows is a seperate set of dependent variables:

    5.032530   6.972868   8.888268   10.732009  12.879130  16.877655
0   2.512298   2.132748   1.890665   1.583538   1.582968   1.440091
1   5.628667   4.206962   4.179009   3.162677   3.132448   1.887631
2   3.177090   2.274014   2.412432   2.066641   1.845065   1.574748
3   5.060260   3.793109   3.129861   2.617136   2.703114   1.921615
4   4.153010   3.354411   2.706463   2.570981   2.020634   1.646298

I would like to fit a curve of type Y=A*x^B to each row. I need to solve for A & B for about ~5000 rows, 6 datapoints in each row. I was able to do this using np.apply, but it takes about 40 seconds to do this. Can I speed up using Cython or by vectorizing somehow? I need precision to about 4 decimals

Here is what i have:

import pandas as pd
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv(r'C:\File.csv')

def curvefita(y):
    return curve_fit(lambda x,a,b: a*np.power(x,b),  df.iloc[:,3:].columns,  y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][0]
def curvefitb(y):
    return curve_fit(lambda x,a,b: a*np.power(x,b),  df.iloc[:,3:].columns,  y,p0=[8.4,-.58], bounds=([0,-10],[200,10]),maxfev=2000)[0][1]

avalues = df.iloc[:,3:].apply(curvefita, axis=1)
bvalues = df.iloc[:,3:].apply(curvefitb, axis=1)
df['a']=avalues
df['b']=bvalues

colcount = len(df.columns)
#build power fit - make the matrix
powerfit = df.copy()
for column in range(colcount-2):
    powerfit.iloc[:,column] = powerfit.iloc[:,colcount-2] * (powerfit.columns[column]**powerfit.iloc[:,colcount-1])


#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')
james
  • 41
  • 6
  • The `curve_fit` I have (in Python 3.6, Scipy 0.18.1), does not have a keyword argument `"maxfev"`. – Graipher Jul 10 '18 at 12:35
  • The use of `bounds` makes `curve_fit` apply *trf* method, which in most cases is much slower than default *lm*. So you can remove your bounds, and just in case, use keyword `method='lm'`. If all your data is positive you can aslo linearize your [equation](http://mathbench.umd.edu/modules/misc_scaling/page11.htm) and then carry out a much faster linear [fit](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.lstsq.html) – Brenlla Jul 10 '18 at 21:57
  • If you're fitting to such a function form, then perhaps doing a first-order polyfit via say `np.polyfit` to `log(x)` and `log(y)` might run faster. – saintsfan342000 Jul 11 '18 at 16:09
  • thanks for the suggestion Brenlla. removing the bounds halved the execution time from 20 to 10s! – james Jul 25 '18 at 02:47
  • @Brenlla my data is indeed all positive. can you help me with a linear fit? should is use numpy.polyfit of the rows of my dataframe as X values and Column headers as y values? – james Jul 25 '18 at 03:03

2 Answers2

3

You actually do two curve_fits per row, one for a and one for b. Try to find a way to insert both of them at the same time, so you can halve your execution time:

def func(x, a, b):
    return a * np.power(x, b)

def curvefit(y):
    return tuple(curve_fit(func, df.iloc[:,3:].columns, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])

df[["a", "b"]] = df.iloc[:,3:].apply(curvefit, axis=1).apply(pd.Series)
print(df)
#     5.03253  6.972868  8.888268  10.732009  12.87913  16.877655          a  \
# 0  2.512298  2.132748  1.890665   1.583538  1.582968   1.440091   2.677070   
# 1  5.628667  4.206962  4.179009   3.162677  3.132448   1.887631  39.878792   
# 2  3.177090  2.274014  2.412432   2.066641  1.845065   1.574748   8.589886   
# 3  5.060260  3.793109  3.129861   2.617136  2.703114   1.921615  13.078827   
# 4  4.153010  3.354411  2.706463   2.570981  2.020634   1.646298  27.715207   

#           b  
# 0 -0.215338  
# 1 -1.044384  
# 2 -0.600827  
# 3 -0.656381  
# 4 -1.008753

And to make this more reusable, I would make curvefit also take the x-values and function, which can be passed in with functools.partial:

from functools import partial

def curvefit(func, x, y):
    return tuple(curve_fit(func, x, y ,p0=[8.4, -.58], bounds=([0, -10], [200, 10]))[0])

fit = partial(curvefit, func, df.iloc[:,3:].columns)
df[["a", "b"]] = df.iloc[:,3:].apply(fit, axis=1).apply(pd.Series)
Graipher
  • 6,891
  • 27
  • 47
  • thank you for the quick reply. I thought of doing that, but even at 20 seconds it is too slow for my purposes. Excel curve fits the 5,000 rows in <1second, but i'm trying to move away from using excel – james Jul 10 '18 at 12:50
  • thank you for the additional help with the functools.partial portion. this is very helpful! – james Jul 11 '18 at 13:14
  • @james: I edited the example a bit, making the order of the arguments of `curvefit` more natural and using positional arguments in the `partial` call instead of keyword arguments (this works as the arguments are filled from left to right and `y` is the last argument). – Graipher Jul 11 '18 at 13:23
  • thanks very much for the additional edits and help. Could I convert to use np.polyfit instead of scipy.optimize.curve_fit? – james Jul 26 '18 at 02:00
1

I was able to bring my runtime down to 550ms by following the advice of @Brenlla. This code uses an unweighted/biased formula similar to Excel, which is good enough for my purposes (@kennytm discusses it here)

df = pd.read_csv(r'C:\File.csv')
df2=np.log(df)
df3=df2.iloc[:,3:].copy()
df3.columns=np.log(df3.columns)

def curvefit(y):
    return tuple(np.polyfit(df3.columns, y ,1))

df[["b", "a"]] = df3.apply(curvefit,axis=1).apply(pd.Series)
df['a']=np.exp(df['a'])

colcount = len(df.columns)
powerfit = df.copy()
for column in range(colcount-2):
    powerfit.iloc[:,column] = powerfit.iloc[:,colcount-1] * (powerfit.columns[column]**powerfit.iloc[:,colcount-2])

#graph an example
plt.plot(powerfit.iloc[0,:colcount-2],'r')
plt.plot(df.iloc[0,:colcount-2],'ro')
#another example looked up by ticker
plt.plot(powerfit.iloc[5,:colcount-2],'b')
plt.plot(df.iloc[5,:colcount-2],'bo')
james
  • 41
  • 6