5

I am handling and working with fairly large dataframes in pandas. Although I am not necessarily interested in optimizing the performances of my code as much as possible, I am extremely surprised by the significant perfomance gap I observe for different versions of the same task when I use pandas only, a hybrid version combining pandas + dict and numpy only. I am not fluent in pandas as I am in numpy and therefore I would like to know if pandas is intrinsically slower than numpy for specific tasks (but more elegant when you write the code) or this may due partly to a poor usage of pandas.

General problem

I have, at a monthly scale, a NxM matrix of floats (where N and M are about 1000 and 5000 respectively). These matrices are, for exogenous reasons, provided as a pandas dataframe, one for each timestamp. At each date, I need to subtract the matrix at that time with the one at date-1, element wise. Then I need to binarize this result in order to have 1 if the element wise difference is 1, 0 otherwise. Finally, for each date, I need to sum along axis=1 i.e. along rows of this binarized matrix. In what follows dict_MVAdj is therefore a dictionary where keys are dates and values are NxM dataframes. The dataframes may contain NaN values:

dict_MVAdj  -> {datetime.datetime(2003, 2, 1, 0, 0):
         s1     s2       s3 
f1  10000.0  192.0   9421.0       
f2   4000.0    NaN   1340.0, 

, datetime.datetime(2003, 3, 1, 0, 0):      
         s1     s2       s3 
f1  24400.0  102.0   9121.0       
f2  14000.0    3.3   1340.0
} 

STRATEGY 1: numpy only

for i,key in enumerate(keySort):
    if i:
         dictDelta[key]= dict_MVAdj[keySort[i]].as_matrix()-dict_MVAdj[keySort[i-1]].as_matrix()
         dictBinPos[key] = (np.where(dictDelta[key]>0,1,0)).sum(axis = 1)

STRATEGY 2: dict of dataframes + pandas

for i,key in enumerate(keySort):
    if i:
         dictDelta[key] = dict_MVAdj[keySort[i]].subtract(dict_MVAdj[keySort[i-1]])
         dictBinPos[key]= pd.DataFrame(((dictDelta[key]>0).astype(int)).sum(axis=1))

         ''' or alternatively this second strategy to binarize but the perfomance is similar '''
         dictBinPos[key]=  = dictDelta[key].where(dictDelta[key]>0, 0).divide(np.abs(dictDelta[key])).sum(axis=1)

STRATEGY 3: only pandas

I concat my dict of Dataframes in a single multi index Dataframe and then

MVAdj = pd.concat(dict_MVAdj)
D_MVAdj = MVAdj.groupby(level=1).diff(1)
D_binPos = pd.DataFrame(((D_MVAdl>0).astype(int)).sum(axis=1))

RESULTS

Strategy 1 outperforms strategy 2 and 3 (by far). Let's say , in arbitrary time units, strategy 1 takes 1 time unit, strategy 2(both versions) takes about 4-5 time units while strategy 3 takes 8-10 time units.The use of the multi index is one of the sources of the bad performance of strategy 3, but the main source seems to be linked to the poor performance of where (or something acting like a where) in pandas.

SirC
  • 2,101
  • 4
  • 19
  • 24
  • 1
    I have no answer to this, but I can confirm that I've seen similar slowdowns. I often wind up just using `numpy`. I'd love to hear what I've been doing wrong! – senderle Aug 02 '17 at 16:37
  • 4
    Numpy will almost always be faster (Pandas is basically a fancy wrapper around it for aligning columns and indices). *However*, most performance issues usually arise from bad code. Perhaps you could post a small sample of your data? [MCVE] – Alexander Aug 02 '17 at 16:45
  • @Alexander how can I post a sample of my data? – SirC Aug 02 '17 at 16:51
  • Use three dates with a small NxM (e.g. 3x2), and include expected output. See https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Alexander Aug 02 '17 at 16:52
  • @Alexander I edited my post with an example of the structure of my dictionary of dataframes with 2 dates and dataframes 2x3 – SirC Aug 02 '17 at 16:59
  • 1
    The Pandas csv reader is reputed to be faster than the numpy ones, largely because it uses compiled code. That may be true of some other Pandas functions. But otherwise I believe Pandas is uses numpy arrays and Python objects. Also it appears that Pandas readily switches to using object dtype for the arrays, which is always slower than numeric dtype. – hpaulj Aug 02 '17 at 17:00
  • For your `pandas` version, why do you need to use `astype(int)`, you don't do that in your `numpy` version...? Also, why `pd.DataFrame(` constructor wrapping everything? As an aside, is `as_matrix()` really necessary in your `numpy` version? – juanpa.arrivillaga Aug 02 '17 at 17:08
  • @juanpa.arrivillaga yes because i need to turn my dataframe into a numpy array, and despite this cast, strategy 1 is the quickest. Do you know other way to turn a df into an np array? – SirC Aug 02 '17 at 17:11
  • @SirC well, my inclination would be to use `df.values`, and I just checked out the docs on that method, which I'd never really used, and it says `This method is provided for backwards compatibility. Generally, it is recommended to use '.values'.`. It is essentially equivalent with `.values`, but at first I thought it was returning a `numpy.matrix` instead of a `numpy.array` – juanpa.arrivillaga Aug 02 '17 at 17:14
  • @SirC but the cast seems completely unecessary, a `boolean` dtype is essentially what you want anyway... it's just more compact than `int` by a factor of 4-8! Also, why the call to the constructor, `pd.DataFrame(`? – juanpa.arrivillaga Aug 02 '17 at 17:16
  • But yeah, `pandas.DataFrame.where` is going to be slower than `numpy.where`. They are not the same, semantically, and `pandas` `.where` builds a *new `pandas` object*. – juanpa.arrivillaga Aug 02 '17 at 17:18
  • @jean-francois fabre, I'd say that the question is an expansion/complementary discussion rather than a duplicate as the original question you mention is general while here we end up discussing the difference of `where` in numpy and pandas. Perhaps they should be merged! – SirC Aug 03 '17 at 07:23

0 Answers0