Querying for NaN and other names in Pandas

Question

Say I have a dataframe df with a column value holding some float values and some NaN. How can I get the part of the dataframe where we have NaN using the query syntax?

The following, for example, does not work:

df.query( '(value < 10) or (value == NaN)' )

I get name NaN is not defined (same for df.query('value ==NaN'))

Generally speaking, is there any way to use numpy names in query, such as inf, nan, pi, e, etc.?

score 132 · Answer 1 · answered Jan 08 '19 at 20:41

132

According to this answer you can use:

df.query('value < 10 | value.isnull()', engine='python')

I verified that it works.

answered Jan 08 '19 at 20:41

Eric Ness

10,119
15
48
51

6

`engine='python'` is not even necessary in this case – Jakub Kukul Feb 28 '20 at 20:07
12

@JakubKukul it depends on the dtype of the column. For floats, that have native NaNs, it is not necessary, but if the column are strings or ints, you do. – Davidmh Apr 28 '20 at 06:59
1

What is engine = 'python' doing in the script? – Liz Apr 13 '21 at 18:10
2

It interprets the expression as a Python expression instead of a pandas expression (numexpr). https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval – Eric Ness Apr 19 '21 at 00:20

score 116 · Accepted Answer · answered Oct 23 '14 at 19:28

116

In general, you could use @local_variable_name, so something like

>>> pi = np.pi; nan = np.nan
>>> df = pd.DataFrame({"value": [3,4,9,10,11,np.nan,12]})
>>> df.query("(value < 10) and (value > @pi)")
   value
1      4
2      9

would work, but nan isn't equal to itself, so value == NaN will always be false. One way to hack around this is to use that fact, and use value != value as an isnan check. We have

>>> df.query("(value < 10) or (value == @nan)")
   value
0      3
1      4
2      9

but

>>> df.query("(value < 10) or (value != value)")
   value
0      3
1      4
2      9
5    NaN

answered Oct 23 '14 at 19:28

DSM

342,061
65
592
494

15

There should be a better way of doing this... but I like the hack. – Stewbaca Dec 30 '15 at 01:01
1

The `@nan` "trick" is *not* working for `numpy` vars e.g. `nan = numpy.nan`. It *does* work to filter out other *string*'s . – WestCoastProjects Jun 17 '17 at 02:56
4

@javadba: er, the whole point of that section is to show that `(value == @nan)` _doesn't_ work, because nan isn't equal to itself, hence my use of the `value != value` trick. – DSM Jun 17 '17 at 02:59
3

Ok I see that now. Setting `value == value` then works for *excluding* NaN's – WestCoastProjects Jun 17 '17 at 03:08
4

Warning: this does not work with pd.NA entries (pandas version 1.0.3). – Auss Jul 17 '20 at 14:03

Jarno · Answer 3 · 2021-02-05T08:58:42.030

You can use the isna and notna Series methods, which is concise and readable.

import pandas as pd
import numpy as np

df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
available = df.query("value.notna()")
print(available)

#    value
# 0    3.0
# 1    4.0
# 2    9.0
# 3   10.0
# 4   11.0
# 6   12.0

not_available = df.query("value.isna()")
print(not_available)

#    value
# 5    NaN

In case you have numexpr installed, you need to pass engine="python" to make it work with .query. numexpr is recommended by pandas to speed up the performance of .query on larger datasets.

available = df.query("value.notna()", engine="python")
print(available)

Alternatively, you can use the toplevel pd.isna function, by referencing it as a local variable. Again, passing engine="python" is required when numexpr is present.

import pandas as pd
import numpy as np


df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
df.query("@pd.isna(value)")

#    value
# 5    NaN

Your first example returns `*** TypeError: 'Series' objects are mutable, thus they cannot be hashed` on Pandas 1.2.1 — Keto, Feb 04 '21 at 23:07
@Keto thanks for the catch! I updated the answer to address this behavior. — Jarno, Feb 05 '21 at 08:59

score 38 · Answer 4 · answered Aug 07 '18 at 21:46

38

For rows where value is not null

df.query("value == value")

For rows where value is null

df.query("value != value")

answered Aug 07 '18 at 21:46

as - if

2,729
1
20
26

3

Nice! I believe this is what the post author wanted. – Veiga Dec 06 '18 at 14:56
Any explanation of how this is working ??? Kinda curious. – Dipanwita Mallick Dec 01 '20 at 00:02
2

@DipanwitaMallick in pandas/numpy NaN != NaN. So NaN is not equal itself. – Vega Jul 29 '21 at 12:13
2

@DipanwitaMallick my comment is maybe a bit too short. In pandas/numpy NaN != NaN. So NaN is not equal itself. So to check if a cell has a NaN value you can check for cell_value != cell_value -> that is only true for NaNs (3 != 3 is False but NaN != NaN is True and that query only returns the ones with True -> the NaNs). To check if a cell has not a NaN you check for cell_value == cell_value -> that is only true for not NaNs (3 == 3 is True but NaN == NaN is False and that query returns only the ones with True -> not NaNs). – Vega Jul 30 '21 at 11:06

AreToo · Answer 5 · 2022-03-10T00:41:08.547

Pandas fills empty cells in a DataFrame with NumPy's nan values. As it turns out, this has some funny properties. For one, nothing is equal to this kind of null, even itself. As a result, you can't search for it by checking for any particular equality.

In : 'nan' == np.nan
Out: False

In : None == np.nan
Out: False

In : np.nan == np.nan
Out: False

However, because a cell containing a np.nan value will not be equal to anything, including another np.nan value, we can check to see if it is unequal to itself.

In : np.nan != np.nan
Out: True

You can take advantage of this using Pandas query method by simply searching for cells where the value in a particular column is unequal to itself.

df.query('a != a')

or

df[df['a'] != df['a']]

This solved my problem, thanks. – Coup Mar 18 '22 at 19:15 — Coup, Mar 18 '22 at 19:15

score 1 · Answer 6 · answered Sep 26 '20 at 00:01

1

This should also work: df.query("value == 'NaN'")

answered Sep 26 '20 at 00:01

foadgr

11
3

Isn't that a string comparison? What if `value` is a float? It is not working for me to for numerical columns. – Giorgio May 12 '22 at 21:15

James Page · Answer 7 · 2020-10-13T18:02:20.487

-1

I think other answers will normally be better. In one case, my query had to go through eval (use eval very carefully) and the syntax below was useful. Requiring a number to be both less than and greater than or equal to excludes all numbers, leaving only null-like values.

df = pd.DataFrame({'value':[3,4,9,10,11,np.nan, 12]})

df.query("value < 10 or (~(value < 10) and ~(value >= 10))")

edited Oct 13 '20 at 18:02

answered Mar 23 '20 at 02:11

James Page

19
3

1

Code-only answers are considered low quality: make sure to provide an explanation what your code does and how it solves the problem. It will help the asker and future readers both if you can add more information in your post. See [Explaining entirely code-based answers](https://meta.stackexchange.com/questions/114762/) – Calos Mar 23 '20 at 02:34

Querying for NaN and other names in Pandas

7 Answers7

Linked

Related