I have a financial dataset with ~2 million rows. I would like to import it as a pandas dataframe and add additional columns by applying rowwise functions utilizing some of the existing column values. For this purpose I would like to not use any techniques like parallelization, hadoop for python, etc, and so I'm faced with the following:
I am already doing this similar to the example below and performance is poor, ~24 minutes to just get through ~20K rows. Note: this is not the actual function, it is completely made up. For the additional columns I am calculating various financial option metrics. I suspect the slow speed is primarily due to iterating over all the rows, not really the functions themselves as they are fairly simple (e.g. calculating price of an option). I know I can speed up little things in the functions themselves, such as using erf instead of the normal distribution, but for this purpose I want to focus on the holistic problem itself.
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Method 1 (could also use itertuples here) - this is the one that takes ~24 minutes now
for row in df.iterrows():
df['px'][row] = func(alpha, beta, df['time'][row], df['vol'][row])
I have also tried vectorizing this but keep getting an error about 'cannot serialize float' or something like that.
My thought is to try one of the following methods, and I am not sure which one would theoretically be fastest? Are there non-linearities associated with running these, such that a test with 1000 rows would not necessarily indicate which would be fastest across all 2 million rows? Probably a separate question, but should I focus on more efficient ways to manage the dataset rather than just focus on applying the functions?
# Alternative 1 (df.apply with existing function above)
df['px'] = df.apply(lambda row: func(alpha, beta, row['time'], row['vol']), axis=1)
# Alternative 2 (numba & jit)
@jit
def func(alpha, beta, time, vol):
px = (alpha*beta)/time * vol
return px
# Alternative 3 (cython)
def func_cython(double alpha, double beta, double time, double vol):
cdef double px
px = (alpha*beta)/time * vol
return px
In the case of Cython and numba, would I still iterate over all the rows using df.apply? Or is there a more efficient way?
I have referenced the following and found them to be helpful in understanding the various options, but not what the 'best' way is to do this (though I suppose it depends ultimately on the application).
https://lectures.quantecon.org/py/need_for_speed.html
Speeding up a numpy loop in python?
http://www.devx.com/opensource/improve-python-performance-with-cython.html