3

I am having a student examination dataset such as follows,

userid      grade   examid  subject     numberofcorrectanswers  numberofwronganswers
4           5       8       Synonyms            NULL                    NULL
4           5       8       Sentence            NULL                    NULL
4           5       8       Whole Numbers       6                       15
4           5       8       Decimals            4                       10
5           5       9       Synonyms            NULL                    NULL
5           5       9       Sentence            NULL                    NULL
5           5       9       Whole Numbers       5                       12
5           5       9       Decimals            3                       1

I want to convert this long format to wide format, where I can have the data as,

userid      grade   examid      Synonyms_numberofcorrectanswers         Synonyms_numberofwronganswers       Sentence_numberofcorrectanswers         Sentence_numberofwronganswers       Whole_numbers_numberofcorrectanswers        Whole_numbers_numberofwronganswers              Decimals_numberofcorrectanswers         Decimals_numberofwronganswers
4           5           8               NULL                                    NULL                                NULL                                    NULL                                6                                           15                                          4                                           10
5           5           9               NULL                                    NULL                                NULL                                    NULL                                5                                           12                                          3                                            1

The following are my trying,

data_subset.set_index(['userid', 'grade','examid','subject']).unstack('subject').reset_index()

But this is not coming in a single flat dataframe. There are several hierarchies inside it. Can anybody help me in making it a single flat dataframe?

Thanks

haimen
  • 1,985
  • 7
  • 30
  • 53

2 Answers2

2

Something like this?

>>> df.groupby(['userid', 'grade','examid','subject']).sum().unstack('subject')
                    numberofcorrectanswers                                 numberofwronganswers                                
subject                           Decimals Sentence Synonyms Whole Numbers             Decimals Sentence Synonyms Whole Numbers
userid grade examid                                                                                                            
4      5     8                           4      NaN      NaN             6                   10      NaN      NaN            15
5      5     9                           3      NaN      NaN             5                    1      NaN      NaN            12
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • If I try this, there is a hierarchy in number of correct answers and number of wronganswers column and inside which everything is placed. I am looking like a output which would give a single level of hierarchy output as I have shown in the sample output. Can you please help me in doing that? – haimen Apr 17 '16 at 01:10
  • But those values are null. Do you just want to sum each of Decimals, Sentence, Synonyms and Whole Numbers, e.g. 6 + nan + nan + 4 =10 for upper left cell? – Alexander Apr 17 '16 at 03:45
1

I'll expand Alexander's answer. Say we have

df2 = df.groupby(['userid', 'grade','examid','subject']).sum().unstack('subject')

We get the names of the two-level column index as a list of 2-tuples with df2.columns.get_values(). To flatten it and combine the names:

new_col_names = ['_'.join((b,a)) for a,b in df2.columns.get_values()]
df2.columns = new_col_names

If needed:

  • to sort columns: for example df2.reindex(columns = sorted(df2.columns))

  • to set userid etc. as columns rather than a multi-index: df2.reset_index()

ptrj
  • 5,152
  • 18
  • 31