0

I have data that is in 3 columns (name, question, response) that resulted from judging a student research symposium. There are 2 possible choices for level (graduate/undergraduate), and 5 for college (science, education, etc.)

What I need to do is take the average of the numerical responses for a given name, and sort by average numerical score for that person to output a table containing:

College  Level    Rank
science  graduate  1st
science  graduate  2nd
science  graduate  3rd

science undergrad 1st
...

education graduate 1st
...
education undergrad 3rd

Here's a sample data table:

name          question  response
Smith, John     Q1      10
Smith, John     Q2       7
Smith, John     Q3      10
Smith, John     Q4      8
Smith, John     Q5      10
Smith, John     Q8      8
Smith, John     level   graduate
Smith, John     colleg  science
Smith, John     Q9      4
Jones, Mary     Q3      10
Jones, Mary     Q2      10
Jones, Mary     Q1      10
Jones, Mary     Q4      10
Jones, Mary     level   undergraduate
Jones, Mary     college education
Jones, Mary     Q6      10
Jones, Mary     Q7      10
Jones, Mary     Q9      10

A talented student did this for us in excel using pivot tables, but I'm sure this can be done using pandas, and I'm very curious how to do it (I'm quite new to pandas). The tricky part is all the 'useful' information is in that 3rd column.

Matt Wood
  • 111
  • 1
  • 1
  • 7
  • 1
    It is really unclear to me .., maybe asking help from the student again ? – BENY Apr 06 '18 at 14:48
  • 1
    See [this post](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and try to provide a [mcve] that shows sample input and the desired output. Your question, as it's written, is unclear. – pault Apr 06 '18 at 14:53
  • The questions without answer, should be considered as 0 points for that question or the result should be the mean excluding this questions? – OriolAbril Apr 06 '18 at 15:55

2 Answers2

1

convert response column to numeric, the string will be na then groupby and aggregate

import pandas as pd
import numpy as np
import StringIO


data = '''name;question;response
Smith, John;Q1;10
Smith, John;Q2;7
Smith, John;Q3;10
Smith, John;Q4;8
Smith, John;Q5;10
Smith, John;Q8;8
Smith, John;level;graduate
Smith, John;colleg;science
Smith, John;Q9;4
Jones, Mary;Q3;10
Jones, Mary;Q2;10
Jones, Mary;Q1;10
Jones, Mary;Q4;10
Jones, Mary;level;undergraduate
Jones, Mary;college;education
Jones, Mary;Q6;10
Jones, Mary;Q7;10
Jones, Mary;Q9;10'''

df = pd.read_csv(StringIO.StringIO(data), delimiter=';')
df['response'] = pd.to_numeric(df['response'], errors='coerce')
df.groupby('name').agg(np.mean).reset_index().sort_values(by='response')

output

    name    response
1   Smith, John 8.142857
0   Jones, Mary 10.000000
galaxyan
  • 5,944
  • 2
  • 19
  • 43
  • This is close - I also need to have 'graduate / undergraduate' in there for each college, but maybe I can take it from here. Apologies my sample data didn't have 4 students like it should have (one grad one undergrad from each of the 2 colleges). – Matt Wood Apr 06 '18 at 16:22
  • @MattWood I think in that case, you have to concat numeric data with each level of education, then do the groupby using name and education level – galaxyan Apr 06 '18 at 16:26
0

You can start pivoting the dataframe in order to obtain the info for the college and level and the question marks in different dataframes:

pivoted = df.pivot_table(index='name',
                         columns='question',
                         values='response',
                         fill_value=np.nan, # Changing this value to 0 would consider 
                                            # unanswered questions as 0 score
                         aggfunc=lambda x: x)
categories = pivoted[['college','level']]
questions = pivoted.drop(['college','level'],axis=1)

And set the question mark average to each student in the categories dataframe:

categories['points'] = questions.astype(float).mean(axis=1,skipna=True)

The skipna=True combined with the fill_value=np.nan makes that unanswered questions will not compute in the average, thus, a student with only one answer which is 10 will have mean 10. As commented, fill_value=0 modifies this behaviour.

Eventually, the values can be sorted using sort_values in order to have a ranking for each category:

categories.sort_values(['college','level','points'],ascending=False)
OriolAbril
  • 7,315
  • 4
  • 29
  • 40
  • @MattWood does this return the desired output? If it doesn't, could you comment why it does not? I'd be glad to edit the question so it properly works – OriolAbril May 14 '18 at 14:52