I am trying to compare some columns on a numpy dataset to another set of columns that contain categorical data:
Gender | Grade | Score 1 | Score 2 | Score 3
M | A | 12 | 8 | 0
M | A | 8 | 11 | 4
M | B | 10 | 8 | 9
F | B | 12 | 2 | 2
F | B | 11 | 8 | 1
F | B | 1 | 4 | 2
What I need to do is compare each score column to Gender and Grade. Then calculate The average score for each category. So for example, mean of values in Score 1 who have gender M and mean for gender F, and mean for grade A and mean for grade B. I cannot define the categories (or Gender and Grade column names) explicitly in code as my actual datasets have a variable distribution of categorical values, some of additional columns with categorical data. I am able to calculate all the categorical values with np.unique()
which gives a full list. However, I am stuck as to how I would create a matrix using ufuncs, broadcasting, masking etc that would allow me to compare each of these columns to the categorical values without loops.
The output should be ideally a matrix which contains all the categories available in one column, with the associated means from other columns.
A B M F
Score1 mean(S1,A) mean(S1,B) mean(S1, M) mean(S1, F)
Score2 mean(S2,A) etc etc etc
Score3 mean(S3,A) etc etc etc