I have a few dozen conditions (e.g., foo > bar
) that I need to evaluate on ~1 MM rows of a DataFrame
, and the most concise way of writing this is to store these conditions as a list of strings and create a DataFrame
of boolean results (one row per record x one column per condition). (User input is not being evaluated.)
In the quest for premature optimization, I am trying to determine whether I should write these conditions for evaluation within DataFrame
(e.g., df.eval("foo > bar")
or just leave it to Python as in eval("df.foo > df.bar")
According to the documentation on enhancing eval performance:
You should not use eval() for simple expressions or for expressions involving small DataFrames. In fact, eval() is many orders of magnitude slower for smaller expressions/objects than plain ol’ Python. A good rule of thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.
It would be nice to be able to use the df.eval("foo > bar")
syntax, because my list would be a little more readable, but I can't ever find a case where it's not slower to evaluate. The documentation shows examples of where pandas.eval()
is faster than python eval()
(which matches my experience) but none for DataFrame.eval()
(which is listed as 'Experimental').
For example, DataFrame.eval()
is still a clear loser in a not-simple expression on a large-ish DataFrame
:
import pandas as pd
import numpy as np
import numexpr
import timeit
someDf = pd.DataFrame({'a':np.random.uniform(size=int(1e6)), 'b':np.random.uniform(size=int(1e6))})
%timeit -n100 someDf.eval("a**b - a*b > b**a - b/a") # DataFrame.eval() on notional expression
%timeit -n100 eval("someDf['a']**someDf['b'] - someDf['a']*someDf['b'] > someDf['b']**someDf['a'] - someDf['b']/someDf['a']")
%timeit -n100 pd.eval("someDf.a**someDf.b - someDf.a*someDf.b > someDf.b**someDf.a - someDf.b/someDf.a")
100 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 18.7 ms per loop
100 loops, best of 3: 15.4 ms per loop
So is the benefit of DataFrame.eval()
merely in simplifying the input, or can we identify circumstances where using this method is actually faster?
Are there any other guidelines for when to use which eval()
? (I'm aware that pandas.eval()
does not support the complete set of operations.)
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
pandas: 0.18.0
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0