12

(2/19/2019): I opened up a report in the numexpr tracker: https://github.com/pydata/numexpr/issues/331

The pandas report is: https://github.com/pandas-dev/pandas/issues/25369

Unless I'm doing something I'm not supposed to, the new dtype extensions for nullable int appear to have a bug with the QUERY method on dataframe (the problem seems to be in the numexpr package):

df_test = pd.DataFrame(data=[4,5,6], columns=["col_test"])
df_test = df_test.astype(dtype={"col_test": pd.Int32Dtype()})
df_test.query("col_test != 6")

Last lines of the long error message are:

File "...\site_packages\numexpr\necompiler.py", line 822, in evaluate zip(names, arguments)] File "...\site_packages\numexpr\necompiler.py", line 821, in signature = [(name, getType(arg)) for (name, arg) in File "...\site_packages\numexpr\necompiler.py", line 703, in getType raise ValueError("unknown type %s" % a.dtype.name) ValueError: unknown type object

The non-extension dtypes work fine:

df_test = df_test.astype(dtype={"col_test": np.int32})
df_test.query("col_test != 6")

(p.s. as an entirely separate issue, passing the dtype to the pd.DataFrame constructor directly doesn't work--seems buggy).

Thanks.

techvslife
  • 2,273
  • 2
  • 20
  • 26

1 Answers1

16

Extension dtypes have been introduced for the first time in 0.24, and there are a lot of kinks to iron out.

That said, this seems to be some kind of compatibility issue between numexpr and pandas. This definitely looks buggy, and until it is fixed, we will have to fall back to the 'python' engine.

df_test.query('col_test != 6', engine='python')

   col_test
0         4
1         5

(More information on query/eval: Dynamic Expression Evaluation in pandas using pd.eval())

Notwithstanding the fact that you could just do

df_test.loc[df_test['col_test'] != 6]

   col_test
0         4
1         5

Which is likely to be a lot faster (using engine='python' offers no performance benefits over loc).

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks, that's helpful (and such a quick response!) Unfortunately query is a lot easier to use in my real (non-reported) case, partly because I'm mixing one multi-index column name with regular columns in the query. As far as I know, there's no easy way to use loc with that (e.g. if a, b are columns only inside a multi-index, and c a normal non-indexed column, I can query: df.query(b==6 and c==4). But for now, I'll use engine='python' with the query and lose the speed. – techvslife Feb 19 '19 at 06:57
  • @techvslife I have a post on slicing multiIndexes that I think will be _super_ helpful to you: https://stackoverflow.com/q/53927460/4909087 please consider upvoting if you find it useful. – cs95 Feb 19 '19 at 07:00
  • I upvoted, as it's a very good post, but I don't see any easy equivalent to the query using loc. How would you do this? – techvslife Feb 19 '19 at 07:11
  • @techvslife If you are filtering on index level "b", you can do something like `df.index.get_level_values('b') == 6` to get a mask, and then proceed as usual. – cs95 Feb 19 '19 at 07:16
  • I need to filter on a mix of one column of the multi-index with a regular column. Even simple cases get unwieldy (where b column is index level 1: df[df["c"]==5].loc[df.index.get_level_values(1) == 35]. Imagine when one column is being compared to another (e.g. a query for where c > b). versus df.query("c > b") or df.query("c==5 and b==35") – techvslife Feb 19 '19 at 07:18
  • You could do `df.loc[(df["c"]==5) & (df.index.get_level_values(1) == 35)]` but I agree that `query` is definitely a lot simpler. – cs95 Feb 19 '19 at 07:23
  • Thanks, that is simpler and works, but I think we also agree it gets quite ugly as the tests increase in number... – techvslife Feb 19 '19 at 07:31