1

I'm trying to break a DataFrame into four parts and to impute rounded mean values for each part using fillna(). I have two columns, main_campus and degree_type I want to filter on, which have two unique values each. So between them I should be able to filter the DataFrame into two groups.

I first did this with a triple for loop (see below), which seems to work, but when I tried to do it in a more elegant way, I got a SettingWithCopy warning that I couldn't fix by using .loc or .copy(), and the missing values wouldn't be filled even when inplace was set to True. Here's the code for the latter method:

# Imputing mean values for main campus BA students
df[(df.main_campus == 1) &
            (df.degree_type == 'BA')] = df[(df.main_campus == 1) &
            (df.degree_type == 'BA')].fillna(
                df[(nulled_data.main_campus == 1) &
                            (df.degree_type == 'BA')
                            ].mean(),
                     inplace=True)
            
# Imputing mean values for main campus BS students
df[(df.main_campus == 1) &
            (df.degree_type == 'BS')] = df[(df.main_campus == 1) &
            (df.degree_type == 'BS')].fillna(
                df[(df.main_campus == 1) &
                            (df.degree_type == 'BS')
                            ].mean(),
                     inplace=True)
            
# Imputing mean values for downtown campus BA students
df[(df.main_campus == 0) &
            (df.degree_type == 'BA')] = df[(df.main_campus == 0) &
            (df.degree_type == 'BA')].fillna(
                df[(df.main_campus == 0) &
                            (df.degree_type == 'BA')
                            ].mean(),
                     inplace=True)

# Imputing mean values for downtown campus BS students          
df[(df.main_campus == 0) &
            (df.degree_type == 'BS')] = df[(df.main_campus == 0) &
            (df.degree_type == 'BS')].fillna(
                df[(df.main_campus == 0) &
                            (df.degree_type == 'BS')
                            ].mean(),
                     inplace=True)      

I should mention the previous code went through several iterations, trying it without setting it back to the slice, with and without inplace, etc.

Here's the code with the triple for loop that works:

imputation_cols = [# all the columns I want to impute]

for col in imputation_cols:

  for i in [1, 0]:

    for path in ['BA', 'BS']:

      group = ndf.loc[((df.main_campus == i) &
                               (df.degree_type == path)), :]
      
      group = group.fillna(value=round(group.mean()))

      df.loc[((df.main_campus == i) &
                               (df.degree_type == path)), :] = group

It's worth mentioning that I think the use of the group variable in the triple for loop code is also to help the filled NaN values actually get set back to the DataFrame, but I would need to double check this.

Does anyone have an idea for what's going on here?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
semblable
  • 773
  • 1
  • 8
  • 26

1 Answers1

1

A good way to approach such a problem is to simplify your code. Simplifying your code makes it easier to find the source of the warning:

group1 = (df.main_campus == 1) & (df.degree_type == 'BA')
group2 = (df.main_campus == 1) & (df.degree_type == 'BS')
group3 = (df.main_campus == 0) & (df.degree_type == 'BA')
group4 = (df.main_campus == 0) & (df.degree_type == 'BS')

# Imputing mean values for main campus BA students
df.loc[group1, :] = df.loc[group1, :].fillna(df.loc[group1, :].mean())  # repeat for other groups

Now you can see the problem more clearly. You are trying to write the mean of the df back to the df. Pandas issues a warning because the slice you use to compute the mean could be inconsistent with the changed dataframe. In your case it produces the correct result. But the consistency of your dataframe is at risk.

You could solve this by computing the mean beforehand:

group1_mean = df.loc[group1, :].mean()
df.loc[group1, :] = df.loc[group1, :].fillna(group1_mean)

In my opinion this makes the code more clear. But you still have four groups (group1, group2, ...). A clear sign to use a loop:

from itertools import product

for campus, degree in product([1, 0], ['BS', 'BA']):
    group = (df.main_campus == campus) & (df.degree_type == degree)
    group_mean = df.loc[group, :].mean()
    df.loc[group, :] = df.loc[group, :].fillna(group_mean)

I have used product from itertools to get rid of the ugly nested loop. It is quite similar to your "inelegant" first solution. So you were almost there the first time.

We ended up with four lines of code and a loop. I am sure with some pandas magic you could convert it to one line. However, you will still understand these four lines in a week or a month or a year from now. Also, other people reading your code will understand it easily. Readability counts.


Disclaimer: I could not test the code since you did not provide a sample dataframe. So my code may throw an error because of a typo. A minimal reproducible example makes it so much easier to answer questions. Please consider this the next time you post a question on SO.

above_c_level
  • 3,579
  • 3
  • 22
  • 37
  • Thanks so much for this answer! Does writing the mean back to the df in this case produce a correct result because it's the _mean_ specifically, and adding values that are the mean mathematically won't change the mean for the column as a whole? – semblable Jul 14 '20 at 20:27
  • 1
    Yes, I think writing back the mean should produce the correct result, because it is the mean. For other operations (e.g. sum) the expected behavior is not clear. With regard to the df: Sharing a dataframe is quite simple. Just use print(df) and copy it in the question. For more information [see this SO post](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – above_c_level Jul 15 '20 at 07:44