5

Using pandas 0.11 on python 2.7.3 I am trying to pivot a simple dataframe with the following values:

    StudentID QuestionID Answer DateRecorded
0        1234        bar      a   2012/01/21
1        1234        foo      c   2012/01/22
2        4321        bop      a   2012/01/22
3        5678        bar      a   2012/01/24
4        8765        baz      b   2012/02/13
5        4321        baz      b   2012/02/15
6        8765        bop      b   2012/02/16
7        5678        bop      c   2012/03/15
8        5678        foo      a   2012/04/01
9        1234        baz      b   2012/04/11
10       8765        bar      a   2012/05/03
11       4321        bar      a   2012/05/04
12       5678        baz      c   2012/06/01
13       1234        bar      b   2012/11/01

I am using the following command:

 df.pivot(index='StudentID', columns='QuestionID')

But I am getting the following error:

ReshapeError: Index contains duplicate entries, cannot reshape

Note that the same dataframe without the last line

13       1234        bar      b   2012/11/01

The pivot results successfully in following:

           Answer               DateRecorded                                    
QuestionID    bar baz  bop  foo          bar         baz         bop         foo
StudentID                                                                       
1234            a   b  NaN    c   2012/01/21  2012/04/11         NaN  2012/01/22
4321            a   b    a  NaN   2012/05/04  2012/02/15  2012/01/22         NaN
5678            a   c    c    a   2012/01/24  2012/06/01  2012/03/15  2012/04/01
8765            a   b    b  NaN   2012/05/03  2012/02/13  2012/02/16         NaN

I am new to pivoting and would like to know why having duplicate StudentID, QuestionID pair causing this problem? And, how can I fix this using the df.pivot() function?

thank you.

iqbalnaved
  • 85
  • 1
  • 8
  • I've already tried these ones : http://stackoverflow.com/questions/13216087/pandas-dataframe-pivoting-issue, http://stackoverflow.com/questions/11232275/pandas-pivot-warning-about-repeated-entries-on-index and http://stackoverflow.com/questions/13216087/pandas-dataframe-pivoting-issue. But the problem is still there. Also, I don't want a overcomplicated MultiIndexed dataframe in the result. – iqbalnaved Jul 31 '13 at 18:33

2 Answers2

5

What do you expect your pivot table to look like with the duplicate entries? I'm not sure it would make sense to have multiple elements for (1234, bar) in the pivot table. Your data looks like it's naturally indexed by (questionID, studentID, dateRecorded).

If you go with the Hierarchical Index approach (they're really not that complicated!) I'd try:

In [104]: df2 = df.set_index(['StudentID', 'QuestionID', 'DateRecorded'])

In [105]: df2
Out[105]: 
                                  Answer
StudentID QuestionID DateRecorded       
1234      bar        2012/01/21        a
          foo        2012/01/22        c
4321      bop        2012/01/22        a
5678      bar        2012/01/24        a
8765      baz        2012/02/13        b
4321      baz        2012/02/15        b
8765      bop        2012/02/16        b
5678      bop        2012/03/15        c
          foo        2012/04/01        a
1234      baz        2012/04/11        b
8765      bar        2012/05/03        a
4321      bar        2012/05/04        a
5678      baz        2012/06/01        c
1234      bar        2012/11/01        b

In [106]: df2.unstack('QuestionID')
Out[106]: 
                       Answer               
QuestionID                bar  baz  bop  foo
StudentID DateRecorded                      
1234      2012/01/21        a  NaN  NaN  NaN
          2012/01/22      NaN  NaN  NaN    c
          2012/04/11      NaN    b  NaN  NaN
          2012/11/01        b  NaN  NaN  NaN
4321      2012/01/22      NaN  NaN    a  NaN
          2012/02/15      NaN    b  NaN  NaN
          2012/05/04        a  NaN  NaN  NaN
5678      2012/01/24        a  NaN  NaN  NaN
          2012/03/15      NaN  NaN    c  NaN
          2012/04/01      NaN  NaN  NaN    a
          2012/06/01      NaN    c  NaN  NaN
8765      2012/02/13      NaN    b  NaN  NaN
          2012/02/16      NaN  NaN    b  NaN
          2012/05/03        a  NaN  NaN  NaN

Otherwise you can come up with some rule to determine which of the multiple entries to take for the pivot table, and avoid the Hierarchical index.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Thank you. Your solution is what I was looking for. Only thing that I need is a single instance per StudentID. Therefore, I think the only way I can get is aggregate on the DateRecorded column by max(). – iqbalnaved Aug 01 '13 at 14:47
0

Instead of relying on Pandas (which is better of course) you could also aggregate your data manually.

def heatmap_seaborn():
    na_lr_measures = [50, 50, 50, 49, 49, 49, 48, 47, 47, 47, 46, 46, 46, 46, 45, 45, 45, 45, 45, 45, 45, 45, 45, 43, 43, 43, 43, 42, 42, 42, 41, 41, 41, 41, 41, 41, 41, 40, 40, 40, 40, 40, 40, 40, 40, 39, 39, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 34, 34, 34, 33, 33, 33, 32, 32, 31, 30, 30, 30, 29, 29]
    na_lr_labels = ('bi2e', 'bi21', 'bi22', 'si21', 'si22', 'si2e', 'si11', 'bi11', 'bi1e', 'si1e', 'bx21', 'ti22', 'bx2e', 'si12', 'ti1e', 'sx22', 'ti21', 'bx22', 'sx2e', 'bi12', 'ti11', 'sx21', 'ti2e', 'ti12', 'sx11', 'sx1e', 'bxx2', 'bx1e', 'bx11', 'tx2e', 'tx22', 'tx21', 'sx12', 'six1', 'six2', 'sixe', 'sixx', 'tx11', 'bx12', 'bix2', 'bix1', 'tx1e', 'bixe', 'bixx', 'bxxe', 'sxx2', 'tx12', 'tixe', 'tix1', 'sxxe', 'sxx1', 'si1x', 'tixx', 'bxx1', 'tix2', 'bi2x', 'sxxx', 'si2x', 'txx1', 'bxxx', 'txxe', 'ti2x', 'sx2x', 'bx2x', 'txxx', 'bi1x', 'tx1x', 'sx1x', 'tx2x', 'txx2', 'bx1x', 'ti1x')
    na_lr_labelcategories = ["TF", "IDF", "Normalisation", "Regularisation", "Acc@161"]


    measures = na_lr_measures
    labels = na_lr_labels
    cats = na_lr_labelcategories


    new_measures = defaultdict(list)
    new_labels = []
    #cats = ["TF", "Normalisation", "Acc@161"]
    for i,c in enumerate(labels):
        c=c[0]+c[2]
        new_labels.append(c)
        m = measures[i]
        new_measures[c].append(m)
    labels = list(set(new_labels))
    measures = []
    for l in labels:
        m = np.mean(new_measures[l])
        measures.append(m)





    df = pd.DataFrame(
                  {cats[0]:pd.Categorical([a[0] for a in labels]), 
                   #cats[1]:pd.Categorical([a[1] for a in labels]), 
                   cats[2]:pd.Categorical([a[1] for a in labels]), 
                   #cats[3]:pd.Categorical([a[3] for a in labels]), 
                   cats[4]:measures})
    print df


    df = df.pivot(cats[0], cats[2], cats[4])
    sns.set_context("paper",font_scale=2.7)
    fig, ax = plt.subplots()
    ax = sns.heatmap(df)
    plt.show()

as you can see in the example a pandas dataframe is built from some arrays then the table is manually aggregated. I did this because I didn't have the time to learn more pandas.

Ash
  • 3,428
  • 1
  • 34
  • 44