0

I am trying to find calculate the mean for a new column.

data['english_combined'] = data['english'] + data['intake_english'] + data['language test scores formatted']

so the english_combined column is a the sum of the other columns. I want to take the mean based on what grades are entered, example if only 'English' and 'inktake_english' have a grade I want to take the mean of these 2. if all 3 test are taken I want to take the mean of the 3 tests combined

I did try something like this with no succes

[np.mean(i,j,k) for i,j,k in zip(data['english'], data['intake_english'], data['language test scores formatted'])]

any suggestions that would work?

pepijn
  • 111
  • 7

1 Answers1

0

df.mean(axis='columns') does what you want. By default, it ignores NaNs (that is, it won't count them for the total when computing the average).

A simple example:

>>> df = pd.DataFrame({'a': [7, 8.5, pd.NA, 6], 
                       'b': [5, 6, 6, 7], 
                       'c': [7, pd.NA, pd.NA, 5]})
>>> df
      a  b     c
0     7  5     7
1   8.5  6  <NA>
2  <NA>  6  <NA>
3     6  7     5
>>> df.mean(axis='columns')
0    6.333333
1    7.250000
2    6.000000
3    6.000000
dtype: float64

Note how row 2 has 6 as its mean, not 2. Similar for row 1.

For your case, it would be something like

data['english_combined'] = data[
            ['english', 'intake_english', 
             'language test scores formatted']].mean(axis='columns')
9769953
  • 10,344
  • 3
  • 26
  • 37