1

i'm abstracting a lot of pandas functionality for an application i'm building. I have a scenario where I need to get the value from the target column where the first occurrence of an unknown number conditions evaluates to true. So something like this:

import pandas as pd

#lets say this dataframe already has data in it with column names and number index rows.
df = pd.Dataframe()

targetCols:list = [targetColName1, targetColName2]
compCols:list = [compColName1, compColName2]
compVals:list = [compVals1, compVals2]

#Some function that concatenates the compCols and compValues in to stirngs
# look like:
#  (f"(df[['{compCols[0]}']=={compVals[0]}]) & "
#   f"(df[['{compCols[1]}'] == {compVals[1]})")
strExeString = ConcatenateExecutonString(compCols, compVals)

dfView = pd.eval(strExeString)
dfView = dfView[targetCols]
return dfview.loc[0:]

However the above causes me to scan through the data set a lot, I want to simply scan the DataFrame once until my conditions are met and then extract the target columns. Can we do this with pandas? I can do this with native list objects but if I can use pandas I'd like to since it can handle large sets more efficiently.

Jamie Marshall
  • 1,885
  • 3
  • 27
  • 50
  • Tip: use `dfView.query(strExeString)` to execute the query directly on the dataframe without having to first call `pd.eval()`. – Martijn Pieters Jul 28 '18 at 18:40
  • @MartijnPieters, `pd.query` isn't feasible as it can't handle columns with illegal characters like spaces. Also, according to discussions i've seen on the GitHub tickets for `pd.query` it actual operates similarly to `.eval` anyway. – Jamie Marshall Jul 28 '18 at 19:20
  • @MartijnPieters also, the question you linked to as a duplicate is completely different from mine. It doesn't address the fact that the table and the conditions are unknown, and the answer provided results in table scans. Essentially the answer provided in the linked question does exactly what my example does already. – Jamie Marshall Jul 28 '18 at 19:27
  • `DataFrame.query()` is a wrapper around `pandas.eval()`, it hands the expression over to `pandas.eval()`; if `pandas.eval()` can handle it, then `DataFrame.query()` can. The point is that you can do the work in one step (`DataFrame.query()` passes the `pandas.eval()` result to `.loc[..]` then to `[...]` if the first one failed). – Martijn Pieters Jul 28 '18 at 21:17
  • Yes, the other question does exactly what yours does, *because that's the option you have*. Note that the [second answer there](https://stackoverflow.com/a/40660242/100297) uses `DataFrame.query()`. – Martijn Pieters Jul 28 '18 at 21:19
  • As stated above, `query` is not an option because it can't handle as many variances as `eval`, see [here.](https://github.com/pandas-dev/pandas/issues/6508) Also, if there is no way to read DataFrames without scanning, then that's the answer. Either way the answer in the question you referenced does not solve the problem i've stated here. – Jamie Marshall Jul 28 '18 at 22:35
  • That same issue ***applies to `pandas.eval()`***, because `DataFrame.query()` delegates directly to `pandas.eval()`. There is zero difference in the parsing engines, because there is no different engine. – Martijn Pieters Jul 28 '18 at 22:55
  • See the [`DataFrame.query()` implementation](https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L2937-L2955); note the `res = self.eval(expr, **kwargs)` line. – Martijn Pieters Jul 28 '18 at 22:57
  • I haven't looked into the code for the backing engine yet, but you can easily test that one works and one doesn't. `myDataframe.query("Some Column == 'someVal'")` does not work while `myDataframe.eval("myDataframe[(['Some Column'] == 'someVal'")])` does. – Jamie Marshall Jul 28 '18 at 23:06

0 Answers0