2

I am trying to compare some columns on a numpy dataset to another set of columns that contain categorical data:

    Gender | Grade | Score 1 | Score 2 | Score 3
    M      | A     | 12      | 8       | 0
    M      | A     | 8       | 11      | 4
    M      | B     | 10      | 8       | 9
    F      | B     | 12      | 2       | 2
    F      | B     | 11      | 8       | 1
    F      | B     | 1       | 4       | 2

What I need to do is compare each score column to Gender and Grade. Then calculate The average score for each category. So for example, mean of values in Score 1 who have gender M and mean for gender F, and mean for grade A and mean for grade B. I cannot define the categories (or Gender and Grade column names) explicitly in code as my actual datasets have a variable distribution of categorical values, some of additional columns with categorical data. I am able to calculate all the categorical values with np.unique() which gives a full list. However, I am stuck as to how I would create a matrix using ufuncs, broadcasting, masking etc that would allow me to compare each of these columns to the categorical values without loops.

The output should be ideally a matrix which contains all the categories available in one column, with the associated means from other columns.

            A           B           M           F
Score1      mean(S1,A)  mean(S1,B)  mean(S1, M) mean(S1, F)
Score2      mean(S2,A)  etc         etc         etc
Score3      mean(S3,A)  etc         etc         etc
RebeccaKennedy
  • 121
  • 1
  • 10

1 Answers1

2

Pivot and concat i.e

one = df.pivot_table(columns=['Gender'],aggfunc='mean')
two = df.pivot_table(columns=['Grade'],aggfunc='mean')
main  = pd.concat([one,two],1)

                  F            M         A        B     
Score 1     8.000000    10.000000     10.0      8.5
Score 2     4.666667     9.000000      9.5      5.5
Score 3     1.666667     4.333333      2.0      3.5

If you want a one line solution then:

main = pd.concat([df.pivot_table(columns=i) for i in ['Grade','Gender']],1) # By default `aggfunc` is mean

Well if you want only the scores to be index and the rest as columns :

cols = df.columns[~df.columns.str.contains('Score')]
# Index(['Gender', 'Grade'], dtype='object')
ndf = pd.concat([df.pivot_table(columns=i) for i in cols],1)
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • Thanks! Is there anyway to write this without explicitly using Gender and Grade? – RebeccaKennedy Jan 02 '18 at 13:42
  • How do you mean without using Gender and Grade explicitly? Using their indices ? This is what you want right. – Bharath M Shetty Jan 02 '18 at 13:44
  • Its exactly on point! Lets say however that my dataset contains x more columns each with more categorical variables. And the X is varying between different datasets so I cannot define Gender or Grade explicitly (as in a different dataset it may be colour, blood type and some other categorical variable). I can only extract them with np.unique() and then loop through them with the above code. Is there a way to input them as well without loops? I hope thats clear. – RebeccaKennedy Jan 02 '18 at 13:50
  • @jazz090 check out the last edit. Hope that makes sense – Bharath M Shetty Jan 02 '18 at 13:53
  • Thanks! I know I am pushing it a bit but anyway to do it without using loops as you have above? I'll tick the above anyway but I am just very curious on eliminating loops. – RebeccaKennedy Jan 02 '18 at 13:55
  • I would say you cant since you need to be specific about each column. You can directly use pivot on both columns but that wont lead you to the expected output you are asking. Let me ask you how many columns do to have in real data that are other than scores? – Bharath M Shetty Jan 02 '18 at 13:55
  • The Gender and Grade columns are examples. In reality the end user would be entering a dataset for this thing to analyse it. They may enter 2 columns, they may enter 10. Who knows. The code just needs to account for the fact that these two columns can vary by the number of columns and the number of categorical data inside them. I was just wondering if there was a elegant way of doing it without loops. What do you mean by a vote? – RebeccaKennedy Jan 02 '18 at 14:04
  • @jazz090 just clicking the up arrow above the check mark :). I cant help unless I see your real data. – Bharath M Shetty Jan 02 '18 at 14:06
  • I already did! Its more of an abstract question really, I saw this vid and it blew my mind https://www.youtube.com/watch?v=EEUXKG97YRw. I was just trying to apply it to my work. No worries tough, I will have a look at the pivot table post you linked to. Thanks for your help. – RebeccaKennedy Jan 02 '18 at 14:08
  • @jazz090 I love vectorizing the solution, you can check one here https://stackoverflow.com/questions/48061793/alternatives-to-pandas-apply-due-to-memoryerror/48061972#48061972 apply is the builtin for loop of pandas. If i could vectorize it I definitely would have done that. – Bharath M Shetty Jan 02 '18 at 14:12