107

Say I have a dataframe df with a column value holding some float values and some NaN. How can I get the part of the dataframe where we have NaN using the query syntax?

The following, for example, does not work:

df.query( '(value < 10) or (value == NaN)' )

I get name NaN is not defined (same for df.query('value ==NaN'))

Generally speaking, is there any way to use numpy names in query, such as inf, nan, pi, e, etc.?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

7 Answers7

132

According to this answer you can use:

df.query('value < 10 | value.isnull()', engine='python')

I verified that it works.

Eric Ness
  • 10,119
  • 15
  • 48
  • 51
  • 6
    `engine='python'` is not even necessary in this case – Jakub Kukul Feb 28 '20 at 20:07
  • 12
    @JakubKukul it depends on the dtype of the column. For floats, that have native NaNs, it is not necessary, but if the column are strings or ints, you do. – Davidmh Apr 28 '20 at 06:59
  • 1
    What is engine = 'python' doing in the script? – Liz Apr 13 '21 at 18:10
  • 2
    It interprets the expression as a Python expression instead of a pandas expression (numexpr). https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval – Eric Ness Apr 19 '21 at 00:20
116

In general, you could use @local_variable_name, so something like

>>> pi = np.pi; nan = np.nan
>>> df = pd.DataFrame({"value": [3,4,9,10,11,np.nan,12]})
>>> df.query("(value < 10) and (value > @pi)")
   value
1      4
2      9

would work, but nan isn't equal to itself, so value == NaN will always be false. One way to hack around this is to use that fact, and use value != value as an isnan check. We have

>>> df.query("(value < 10) or (value == @nan)")
   value
0      3
1      4
2      9

but

>>> df.query("(value < 10) or (value != value)")
   value
0      3
1      4
2      9
5    NaN
DSM
  • 342,061
  • 65
  • 592
  • 494
47

You can use the isna and notna Series methods, which is concise and readable.

import pandas as pd
import numpy as np

df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
available = df.query("value.notna()")
print(available)

#    value
# 0    3.0
# 1    4.0
# 2    9.0
# 3   10.0
# 4   11.0
# 6   12.0

not_available = df.query("value.isna()")
print(not_available)

#    value
# 5    NaN

In case you have numexpr installed, you need to pass engine="python" to make it work with .query. numexpr is recommended by pandas to speed up the performance of .query on larger datasets.

available = df.query("value.notna()", engine="python")
print(available)

Alternatively, you can use the toplevel pd.isna function, by referencing it as a local variable. Again, passing engine="python" is required when numexpr is present.

import pandas as pd
import numpy as np


df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
df.query("@pd.isna(value)")

#    value
# 5    NaN
Jarno
  • 6,243
  • 3
  • 42
  • 57
38

For rows where value is not null

df.query("value == value")

For rows where value is null

df.query("value != value")
as - if
  • 2,729
  • 1
  • 20
  • 26
  • 3
    Nice! I believe this is what the post author wanted. – Veiga Dec 06 '18 at 14:56
  • Any explanation of how this is working ??? Kinda curious. – Dipanwita Mallick Dec 01 '20 at 00:02
  • 2
    @DipanwitaMallick in pandas/numpy NaN != NaN. So NaN is not equal itself. – Vega Jul 29 '21 at 12:13
  • 2
    @DipanwitaMallick my comment is maybe a bit too short. In pandas/numpy NaN != NaN. So NaN is not equal itself. So to check if a cell has a NaN value you can check for cell_value != cell_value -> that is only true for NaNs (3 != 3 is False but NaN != NaN is True and that query only returns the ones with True -> the NaNs). To check if a cell has not a NaN you check for cell_value == cell_value -> that is only true for not NaNs (3 == 3 is True but NaN == NaN is False and that query returns only the ones with True -> not NaNs). – Vega Jul 30 '21 at 11:06
10

Pandas fills empty cells in a DataFrame with NumPy's nan values. As it turns out, this has some funny properties. For one, nothing is equal to this kind of null, even itself. As a result, you can't search for it by checking for any particular equality.

In : 'nan' == np.nan
Out: False

In : None == np.nan
Out: False

In : np.nan == np.nan
Out: False

However, because a cell containing a np.nan value will not be equal to anything, including another np.nan value, we can check to see if it is unequal to itself.

In : np.nan != np.nan
Out: True

You can take advantage of this using Pandas query method by simply searching for cells where the value in a particular column is unequal to itself.

df.query('a != a')

or

df[df['a'] != df['a']]
AreToo
  • 1,102
  • 11
  • 24
1

This should also work: df.query("value == 'NaN'")

foadgr
  • 11
  • 3
  • Isn't that a string comparison? What if `value` is a float? It is not working for me to for numerical columns. – Giorgio May 12 '22 at 21:15
-1

I think other answers will normally be better. In one case, my query had to go through eval (use eval very carefully) and the syntax below was useful. Requiring a number to be both less than and greater than or equal to excludes all numbers, leaving only null-like values.

df = pd.DataFrame({'value':[3,4,9,10,11,np.nan, 12]})

df.query("value < 10 or (~(value < 10) and ~(value >= 10))")
James Page
  • 19
  • 3
  • 1
    Code-only answers are considered low quality: make sure to provide an explanation what your code does and how it solves the problem. It will help the asker and future readers both if you can add more information in your post. See [Explaining entirely code-based answers](https://meta.stackexchange.com/questions/114762/) – Calos Mar 23 '20 at 02:34