116

Here is how I encountered the warning:

df.loc[a_list][df.a_col.isnull()]

The type of a_list is Int64Index, it contains a list of row indexes. All of these row indexes belong to df.

The df.a_col.isnull() part is a condition I need for filtering.

If I execute the following commands individually, I do not get any warnings:

df.loc[a_list]
df[df.a_col.isnull()]

But if I put them together df.loc[a_list][df.a_col.isnull()], I get the warning message (but I can see the result):

Boolean Series key will be reindexed to match DataFrame index

What is the meaning of this warning message? Does it affect the result that it returned?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Cheng
  • 16,824
  • 23
  • 74
  • 104

3 Answers3

123

Your approach will work despite the warning, but it's best not to rely on implicit, unclear behavior.

Solution 1, make the selection of indices in a_list a boolean mask:

df[df.index.isin(a_list) & df.a_col.isnull()]

Solution 2, do it in two steps:

df2 = df.loc[a_list]
df2[df2.a_col.isnull()]

Solution 3, if you want a one-liner, use a trick found here:

df.loc[a_list].query('a_col != a_col')

The warning comes from the fact that the boolean vector df.a_col.isnull() is the length of df, while df.loc[a_list] is of the length of a_list, i.e. shorter. Therefore, some indices in df.a_col.isnull() are not in df.loc[a_list].

What pandas does is reindex the boolean series on the index of the calling dataframe. In effect, it gets from df.a_col.isnull() the values corresponding to the indices in a_list. This works, but the behavior is implicit, and could easily change in the future, so that's what the warning is about.

IanS
  • 15,771
  • 9
  • 60
  • 84
14

If you got this warning, using .loc[] instead of [] suppresses this warning.1

df.loc[boolean_mask]           # <--------- OK
df[boolean_mask]               # <--------- warning

For the particular case in the OP, you can chain .loc[] indexers:

df.loc[a_list].loc[df['a_col'].isna()]

or chain all conditions using and inside query():

# if a_list is a list of indices of df
df.query("index in @a_list and a_col != a_col")

# if a_list is a list of values in some other column such as b_col
df.query("b_col in @a_list and a_col != a_col")

or chain all conditions using & inside [] (as in @IanS's post).


This warning occurs if

  • the index of the boolean mask is not in the same order as the index of the dataframe it is filtering.

    df = pd.DataFrame({'a_col':[1, 2, np.nan]}, index=[0, 1, 2])
    m1 = pd.Series([True, False, True], index=[2, 1, 0])
    df.loc[m1]       # <--------- OK
    df[m1]           # <--------- warning
    
  • the index of a boolean mask is a super set of the index of the dataframe it is filtering. For example:

    m2 = pd.Series([True, False, True, True], np.r_[df.index, 10])
    df.loc[m2]       # <--------- OK
    df[m2]           # <--------- warning
    

1: If we look at the source codes of [] and loc[], literally the only difference when the index of the boolean mask is a (weak) super set of the index of the dataframe is that [] shows this warning (via _getitem_bool_array method) and loc[] does not.

cottontail
  • 10,268
  • 18
  • 50
  • 51
0

Coming across this page, I received the same error by querying the full dataframe, but using the results against sub data.

Create a subset of data and store in variable sub_df:

sub_df = df[df['a'] == 1]
sub_df = sub_df[df['b'] == 1] # Note "df" hiding here

Solution:

Be sure to use the same dataframe each time (in my case, only sub_df):

# Last line should instead be:
sub_df = sub_df[sub_df['b'] == 1]

KJ Price
  • 5,774
  • 3
  • 21
  • 34