0

Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.

for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]

    df_D['count'].iloc[x] = len(df_C) # triggers warning

I've tried:

  • Copying df_A and df_B in every possible place
  • Using a mask
  • Using a query

I know I can suppress the warning, but I don't want to do that.

What am I missing? I know it's probably something obvious.

Many thanks!

cs_stackX
  • 1,407
  • 2
  • 19
  • 27
  • 1
    Not sure I follow, but it doesn't make sense to me that you do `df_A.loc` based on a different data frame, i.e. `db_B`. – Quang Hoang Apr 05 '19 at 16:37
  • IIUC, the problem is to select the column of `df_D` and then use `iloc`. I would say if before the loop, you do `pos_col_D = df_D.columns.get_loc['count'] ` and then in the loop do `df_D.iloc[x,pos_col_D ] = len(df_C)` should remove the warning. – Ben.T Apr 05 '19 at 17:35
  • @Ben.T that worked. If you want to put it as an answer I will accept. Would be awesome if you can explain why it works too :) – cs_stackX Apr 08 '19 at 16:16

1 Answers1

1

For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.

To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:

pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]

    df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning

Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.

#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})

#your result
for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]
    print (len(df_C))
3
2
1

#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
         & np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
        .sum(0) )
array([3, 2, 1])

so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you

Ben.T
  • 29,160
  • 6
  • 32
  • 54