Other Observations
- You cannot expect more speedup if you have only 2 cores on your machine. In the end numexpression relies on parallelisation and the performant usage of the cpu cache.
- What you do is to some extend wrong. Numexpressions on DataFrames are fast, but wrong. They do not return the right result, if the DataFrames are not equally indexed. Different sorting will already trouble you, which I show below.
- If you add DataFrames with different indexes, the whole stuff is not that performant anymore. Well, Pandas does quite a good job to add the proper rows for you by looking up the corrsponding index entries. This comes with the natural cost.
In the following my Observations:
- First, I reproduce your test case and come to other results. Using numexpression under the hood of Pandas increases performance significantly.
- Second, I sort one of the four DataFrames in descending order and rerun all cases. The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.
Equal Indices on all Frames
This case reproduces your case. The only difference is, that I create copies of the inital DataFrame instance. So there is nothing shared. There are different objects (ids) in use to make sure, that numexpression can deal with it.
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy()
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
True
(Slightly) Different Indices on some Frames
Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
False
Conclusions
By using numexpr
- Quite some speedup is gained when operating on equally indexed DataFrames.
- The same is true if you have other expressions with a single dataframe, as
2 * df1
.
- This is not the case if operations between DataFrames with different indices are used.
- It leads even to completely wrong results if one evaluates expressions containing Pandas DataFrames. By chance they can be right. But numexpression is made for optimizing expressions on Numpy arrays.