I am handling and working with fairly large dataframes in pandas. Although I am not necessarily interested in optimizing the performances of my code as much as possible, I am extremely surprised by the significant perfomance gap I observe for different versions of the same task when I use pandas only, a hybrid version combining pandas + dict
and numpy only. I am not fluent in pandas as I am in numpy and therefore I would like to know if pandas is intrinsically slower than numpy for specific tasks (but more elegant when you write the code) or this may due partly to a poor usage of pandas.
General problem
I have, at a monthly scale, a NxM
matrix of floats (where N and M are about 1000 and 5000 respectively). These matrices are, for exogenous reasons, provided as a pandas dataframe, one for each timestamp. At each date, I need to subtract the matrix at that time with the one at date-1
, element wise. Then I need to binarize this result in order to have 1 if the element wise difference is 1, 0 otherwise. Finally, for each date, I need to sum along axis=1
i.e. along rows of this binarized matrix. In what follows dict_MVAdj
is therefore a dictionary where keys are dates and values are NxM dataframes. The dataframes may contain NaN
values:
dict_MVAdj -> {datetime.datetime(2003, 2, 1, 0, 0):
s1 s2 s3
f1 10000.0 192.0 9421.0
f2 4000.0 NaN 1340.0,
, datetime.datetime(2003, 3, 1, 0, 0):
s1 s2 s3
f1 24400.0 102.0 9121.0
f2 14000.0 3.3 1340.0
}
STRATEGY 1: numpy only
for i,key in enumerate(keySort):
if i:
dictDelta[key]= dict_MVAdj[keySort[i]].as_matrix()-dict_MVAdj[keySort[i-1]].as_matrix()
dictBinPos[key] = (np.where(dictDelta[key]>0,1,0)).sum(axis = 1)
STRATEGY 2: dict of dataframes + pandas
for i,key in enumerate(keySort):
if i:
dictDelta[key] = dict_MVAdj[keySort[i]].subtract(dict_MVAdj[keySort[i-1]])
dictBinPos[key]= pd.DataFrame(((dictDelta[key]>0).astype(int)).sum(axis=1))
''' or alternatively this second strategy to binarize but the perfomance is similar '''
dictBinPos[key]= = dictDelta[key].where(dictDelta[key]>0, 0).divide(np.abs(dictDelta[key])).sum(axis=1)
STRATEGY 3: only pandas
I concat my dict of Dataframes in a single multi index Dataframe and then
MVAdj = pd.concat(dict_MVAdj)
D_MVAdj = MVAdj.groupby(level=1).diff(1)
D_binPos = pd.DataFrame(((D_MVAdl>0).astype(int)).sum(axis=1))
RESULTS
Strategy 1 outperforms strategy 2 and 3 (by far). Let's say , in arbitrary time units, strategy 1 takes 1 time unit, strategy 2(both versions) takes about 4-5 time units while strategy 3 takes 8-10 time units.The use of the multi index is one of the sources of the bad performance of strategy 3, but the main source seems to be linked to the poor performance of where
(or something acting like a where
) in pandas.