1

I am trying to speed up the sum for several big multilevel dataframes.

Here is a sample:

df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage, 
#they can also be mul_df(5000,30,400) 
df2, df3, df4 = df1, df1, df1

In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop

I am not satisfy with the 993ms, Is there any way to speed up ? Can cython improve the performance ? If yes, how to write the cython code ? Thanks.

Note: mul_df() is the function to create the demo multilevel dataframe.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

Update:

Data on my Pentium Dual-Core T4200@2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))

In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne

In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1

In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop

In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop

In [9]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop
Jeff Tratner
  • 16,270
  • 4
  • 47
  • 67
bigbug
  • 55,954
  • 42
  • 77
  • 96
  • For information on the `pd.eval()` family of functions, their features and use cases (along with the difference between `pd.eval()` and `df.eval()`), please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). – cs95 Dec 16 '18 at 04:51
  • What is your use case? I suppose you want to sum several dataframes having the same index / multiindex. Note, pandas is faster with numexpr, iff you add dataframes with the same index. Otherwise not. With same index I mean having it also sorted in the same way. – thomas Jun 17 '20 at 11:21
  • I think you cannot expect much performance gain with two cores only. – thomas Jun 17 '20 at 12:23

2 Answers2

8

method 1: On my machine not so bad (with numexpr disabled)

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop

method 2: Using numexpr (which is by default enabled if numexpr is installed)

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop

method 3: Direct use of numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop

These speedups are achieved using numexpr because:

  • avoids using intermediate temporary arrays (which in the case you are presenting is probably quite inefficient in numpy, I suspect this is being evaluated like ((df1+df2)+df3)+df4
  • uses multi-cores as available

As I hinted above, pandas uses numexpr under the hood for certain types of ops (in 0.11), e.g. df1 + df2 would be evaluated this way, however the example you are giving here will result in several calls to numexpr (this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...) achieves even more speedups.

Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop

Lastly to answer your question, cython will not help here at all; numexpr is quite efficient at this type of problem (that said, there are situation where cython is helpful)

One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    @bigbug Take the leap and try `eval`! It does alignment for you. Currently the `dtype` must at least be numeric, the caveat *there* is that mixed numeric `dtype` frames will be slower, especially large ones since a new array must be allocated to hold the promoted, concatenated array, but they should still be faster than using direct `numexpr` and plain ol' Python. – Phillip Cloud Jul 01 '13 at 00:48
  • @Jeff, Thanks for the tips. I test your methods on my machine (An old machine which I bought 4yrs ago), and the data is posted on the 'Update' section. The performance variance of turn on/off set_use_numexpr() is not much, while direcly ne.evaluate('df1+df2+df3+df4') did give 2.5X improvement. – bigbug Jul 01 '13 at 04:35
  • @cpcloud. I will try eval() after I figure out how to import the eval function. Thanks. – bigbug Jul 01 '13 at 04:36
  • @Jeff, if df2, df3, df4 is not same as df1, then DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32') will not boost the speed that much. I suppose it is due to the cache . – bigbug Jul 01 '13 at 10:09
  • @bigbug looking at your config, you are running 32-bit and have a small amount of memory. I bet you are swapping. – Jeff Jul 01 '13 at 11:09
  • @Jeff, I reduce the dataframe size and let df_i = mul_df(2000, 30, 200) , methods 1-3 costs 249ms, 206ms, 103ms per loop . – bigbug Jul 01 '13 at 12:03
  • Using pandas DataFrames in numerical expressions to be evaluated with numexpr leads to wrong results, if the indices are not aligned, for example: if they have the same labels but not the same ordering/sorting. You still get a result but it is wrong. I suggest not to opperate on DataFrames this way. – thomas Jun 17 '20 at 19:52
0

Other Observations

  • You cannot expect more speedup if you have only 2 cores on your machine. In the end numexpression relies on parallelisation and the performant usage of the cpu cache.
  • What you do is to some extend wrong. Numexpressions on DataFrames are fast, but wrong. They do not return the right result, if the DataFrames are not equally indexed. Different sorting will already trouble you, which I show below.
  • If you add DataFrames with different indexes, the whole stuff is not that performant anymore. Well, Pandas does quite a good job to add the proper rows for you by looking up the corrsponding index entries. This comes with the natural cost.

In the following my Observations: - First, I reproduce your test case and come to other results. Using numexpression under the hood of Pandas increases performance significantly. - Second, I sort one of the four DataFrames in descending order and rerun all cases. The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.

Equal Indices on all Frames

This case reproduces your case. The only difference is, that I create copies of the inital DataFrame instance. So there is nothing shared. There are different objects (ids) in use to make sure, that numexpression can deal with it.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy() 
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit 
df1 + df2 + df3 + df4
152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
True

(Slightly) Different Indices on some Frames

Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit 
df1 + df2 + df3 + df4
928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
False

Conclusions

By using numexpr

  • Quite some speedup is gained when operating on equally indexed DataFrames.
  • The same is true if you have other expressions with a single dataframe, as 2 * df1.
  • This is not the case if operations between DataFrames with different indices are used.
  • It leads even to completely wrong results if one evaluates expressions containing Pandas DataFrames. By chance they can be right. But numexpression is made for optimizing expressions on Numpy arrays.
thomas
  • 319
  • 3
  • 9