taking mean over multiple columns

Question

I am trying to find calculate the mean for a new column.

data['english_combined'] = data['english'] + data['intake_english'] + data['language test scores formatted']

so the english_combined column is a the sum of the other columns. I want to take the mean based on what grades are entered, example if only 'English' and 'inktake_english' have a grade I want to take the mean of these 2. if all 3 test are taken I want to take the mean of the 3 tests combined

I did try something like this with no succes

[np.mean(i,j,k) for i,j,k in zip(data['english'], data['intake_english'], data['language test scores formatted'])]

any suggestions that would work?

Thanks, that makes it clearer. If no grade is entered for a record, what is its value? NaN? Because the mean function has the option to ignore nan values, and adjust accordingly (i.e., average over two cells instead of three). — 9769953, Apr 21 '22 at 11:54
Use `df.mean(axis='columns')`: `skipna` is `True` by default. — 9769953, Apr 21 '22 at 11:56

score 0 · Accepted Answer · answered Apr 21 '22 at 11:57

df.mean(axis='columns') does what you want. By default, it ignores NaNs (that is, it won't count them for the total when computing the average).

A simple example:

>>> df = pd.DataFrame({'a': [7, 8.5, pd.NA, 6], 
                       'b': [5, 6, 6, 7], 
                       'c': [7, pd.NA, pd.NA, 5]})
>>> df
      a  b     c
0     7  5     7
1   8.5  6  <NA>
2  <NA>  6  <NA>
3     6  7     5
>>> df.mean(axis='columns')
0    6.333333
1    7.250000
2    6.000000
3    6.000000
dtype: float64

Note how row 2 has 6 as its mean, not 2. Similar for row 1.

For your case, it would be something like

data['english_combined'] = data[
            ['english', 'intake_english', 
             'language test scores formatted']].mean(axis='columns')

taking mean over multiple columns

1 Answers1