Pivot a dataframe with duplicate values in Index

Question

I have a pandas dataframe like this

    snapDate     instance   waitEvent                   AvgWaitInMs
0   2015-Jul-03  XX         gc cr block 3-way               1
1   2015-Jun-29  YY         gc current block 3-way          2
2   2015-Jul-03  YY         gc current block 3-way          1
3   2015-Jun-29  XX         gc current block 3-way          2
4   2015-Jul-01  XX         gc current block 3-way          2
5   2015-Jul-01  YY         gc current block 3-way          2
6   2015-Jul-03  XX         gc current block 3-way          2
7   2015-Jul-03  YY         log file sync                   9
8   2015-Jun-29  XX         log file sync                   8
9   2015-Jul-03  XX         log file sync                   8
10  2015-Jul-01  XX         log file sync                   8
11  2015-Jul-01  YY         log file sync                   9
12  2015-Jun-29  YY         log file sync                   8

I need to transform this to

snapDate        instance    gc cr block 3-way    gc current block 3-way  log file sync  
2015-Jul-03       XX              1                      Na                  8
2015-Jun-29       YY              Na                     2                   8 
2015-Jul-03       YY              Na                     1                   9
...

I tried pivot but it returns an error dfWaits.pivot(index = 'snapDate', columns = 'waitEvent', values = 'AvgWaitInMs') Index contains duplicate entries, cannot reshape

The result should be another dataFrame

Colonel Beauvel · Accepted Answer · 2015-07-15T09:02:40.060

You can also use pivot_table:

df.pivot_table(index=['snapDate','instance'], columns='waitEvent', values='AvgWaitInMs')

Out[64]:
waitEvent             gc cr block 3-way  gc current block 3-way  log file sync
snapDate    instance
2015-Jul-01 XX                      NaN                       2              8
            YY                      NaN                       2              9
2015-Jul-03 XX                        1                       2              8
            YY                      NaN                       1              9
2015-Jun-29 XX                      NaN                       2              8
            YY                      NaN                       2              8

Data:

I used the following txt file as input (with read_csv from pandas to get the data.frame):

snapDate;instance;waitEvent;AvgWaitInMs
0;2015-Jul-03;XX;gc cr block 3-way;1
1;2015-Jun-29;YY;gc current block 3-way;2
2;2015-Jul-03;YY;gc current block 3-way;1
3;2015-Jun-29;XX;gc current block 3-way;2
4;2015-Jul-01;XX;gc current block 3-way;2
5;2015-Jul-01;YY;gc current block 3-way;2
6;2015-Jul-03;XX;gc current block 3-way;2
7;2015-Jul-03;YY;log file sync;9
8;2015-Jun-29;XX;log file sync;8
9;2015-Jul-03;XX;log file sync;8
10;2015-Jul-01;XX;log file sync;8
11;2015-Jul-01;YY;log file sync;9
12;2015-Jun-29;YY;log file sync;8

it is certainly due to your data...what is the type of your column `AvgWaitInMs`? — Colonel Beauvel, Jul 15 '15 at 08:53
snapDate object instance object waitEvent object AvgWaitInMs object dtype: object — jhon.smith, Jul 15 '15 at 08:55
you can check the data, I put them on my side. Actually the OP should provide a reproducible example ;) (like a dict to copy/paste) — Colonel Beauvel, Jul 15 '15 at 09:03
Hi I went ahead and changed the datatype of AvgWaitInMs to int and the pivot worked — jhon.smith, Jul 15 '15 at 10:47

Jianxun Li · Answer 2 · 2015-07-15T08:37:36.777

Here is one way to reshape the dataframe to something similar to what you want. Let me know if you have any additional specific requirements on the resulting dataframe.

import pandas as pd

# your data
# ====================================
print(df)

       snapDate instance               waitEvent  AvgWaitInMs
0                                                            
0   2015-Jul-03       XX       gc cr block 3-way            1
1   2015-Jun-29       YY  gc current block 3-way            2
2   2015-Jul-03       YY  gc current block 3-way            1
3   2015-Jun-29       XX  gc current block 3-way            2
4   2015-Jul-01       XX  gc current block 3-way            2
5   2015-Jul-01       YY  gc current block 3-way            2
6   2015-Jul-03       XX  gc current block 3-way            2
7   2015-Jul-03       YY           log file sync            9
8   2015-Jun-29       XX           log file sync            8
9   2015-Jul-03       XX           log file sync            8
10  2015-Jul-01       XX           log file sync            8
11  2015-Jul-01       YY           log file sync            9
12  2015-Jun-29       YY           log file sync            8

# processing
# ====================================
df_temp = df.set_index(['snapDate', 'instance', 'waitEvent']).unstack().fillna(0)

df_temp.columns = df_temp.columns.get_level_values(1).values

df_temp = df_temp.reset_index('instance')

print(df_temp)

            instance  gc cr block 3-way  gc current block 3-way  log file sync
snapDate                                                                      
2015-Jul-01       XX                  0                       2              8
2015-Jul-01       YY                  0                       2              9
2015-Jul-03       XX                  1                       2              8
2015-Jul-03       YY                  0                       1              9
2015-Jun-29       XX                  0                       2              8
2015-Jun-29       YY                  0                       2              8

Almost there is there a way to repeat the snapDate instead of the blanks. — jhon.smith, Jul 15 '15 at 08:30
@jhon.smith sure, just add `.reset_index()`. I've updated the code. Please have a look. — Jianxun Li, Jul 15 '15 at 08:32
Hmm My dataframe had 12 rows but when i tried the unstack operation the resulting dataframe has only 6 rows not exactly what i want.My resulting dataframe should also have 12 rows — jhon.smith, Jul 15 '15 at 08:41
@jhon.smith I don't think you can keep the same number of rows as your original df because some rows have to mov to columns. For example, say `Jul-03` data, row `0, 6, 9` are all records about the same `snapDate` with instance `XX`. So doing a pivot would reshape these 3 rows to only one row because those data have been moved to columns. — Jianxun Li, Jul 15 '15 at 08:44

Pivot a dataframe with duplicate values in Index

2 Answers2

Linked