6

I am trying to reshape a pandas dataframe, by turning one of the columns in the data, into rows (by pivoting or unstacking).

I am new to this, so likely that I'm missing something obvious. I've searched extensively, but have not been able to successfully apply any solutions that I've come across.

df
    Location    Month       Metric       Value
0   Texas       January     Temperature  10
1   New York    January     Temperature  20
2   California  January     Temperature  30
3   Alaska      January     Temperature  40
4   Texas       January     Color        Red
5   New York    January     Color        Blue
6   California  January     Color        Green
7   Alaska      January     Color        Yellow
8   Texas       February    Temperature  15
9   New York    February    Temperature  25
10  California  February    Temperature  35
11  Alaska      February    Temperature  NaN
12  Texas       February    Color        NaN
13  New York    February    Color        Purple
14  California  February    Color        Orange
15  Alaska      February    Color        Brown

I am trying to "pivot" the Metric values into columns. End goal is a result like this:

Location    Month     Temperature   Color
Texas       January   10            Red
New York    January   20            Blue
California  January   30            Green
Alaska      January   40            Yellow
Texas       February  15    
New York    February  25            Purple
California  February  35            Orange
Alaska      February                Brown

I have tried using pivot, pivot_table, as well as unstack methods, but I'm sure I'm missing something. Many of the complications seem to come because I am mixing strings with numbers, and have some missing values in the data as well.

This is the closest I have been able to get so far, but I don't want extra rows for the month column, resulting in more blank values:

df.set_index(['Location','Month','Metric'], append=True, inplace=True)
df.unstack()

    Value
    Metric              Color   Temperature
    Location    Month       
0   Texas       January None    10
1   New York    January None    20
2   California  January None    30
3   Alaska      January None    40
4   Texas       January Red     None
5   New York    January Blue    None
6   California  January Green   None
7   Alaska      January Yellow  None

Any help here would be greatly appreciated. This seems like something that most likely has a simple solution available.

brendxn
  • 73
  • 1
  • 5

2 Answers2

5

A pivot solution to what you need. The output is semantics to what you want -

Metric                Color Temperature
Location   Month                       
Alaska     February   Brown         NaN
           January   Yellow          40
California February  Orange          35
           January    Green          30
New York   February  Purple          25
           January     Blue          20
Texas      February     NaN          15
           January      Red          10

Code -

df_p = df.pivot_table(index=['Location', 'Month'], columns=['Metric'], values='Value', aggfunc=np.sum)
Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
  • Thanks a lot! To me (beginner) this seems like the most straightforward solution. Interested to hear if there are any downsides to this approach though, compared to using unstack, as someone else suggested. Also, I added the following, to make the output in the format I need: df_p.reset_index().rename_axis(None, axis=1) – brendxn Feb 28 '18 at 18:55
4

First for correct ordering convert column Month to ordered categorical and then reshape by set_index with unstack:

#add another months
cats = ['January','February']
df['Month'] = pd.Categorical(df['Month'], categories=cats, ordered=True)

Or if months are ordered in original column, thanks @asongtoruin:

df['Month'] = pd.Categorical(df['Month'], categories=df['Month'].unique(), ordered=True)

df = (df.set_index(['Location','Month','Metric'])['Value']
       .unstack()
       .reset_index()
       .rename_axis(None, axis=1)
       .sort_values('Month'))
print (df)
     Location     Month   Color Temperature
0      Alaska   January  Yellow          40
2  California   January   Green          30
4    New York   January    Blue          20
6       Texas   January     Red          10
1      Alaska  February   Brown         NaN
3  California  February  Orange          35
5    New York  February  Purple          25
7       Texas  February     NaN          15
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    how about `pd.Categorical(df['Month'], categories=df['Month'].unique(), ordered=True)` for increased flexibility? – asongtoruin Feb 28 '18 at 11:14
  • @asongtoruin - good idea, thanks. – jezrael Feb 28 '18 at 11:15
  • @asongtoruin, if `Month` contains strings as in this question, that would impose lexicographic (alphabetical) order on month names, which is usually not desired. You'd have to hand-code proper month ordering, or `import calendar` and use `calendar.month_name`. **EDIT:** never mind, I mistakenly thought of `pd.Categorical(df['month'].unique(), ordered=True)`. – Peter Leimbigler Feb 28 '18 at 12:16
  • @PeterLeimbigler yes you are correct - using `.unique()` ensures that you always capture every month available but doesn't necessarily retain calendar order. It depends on what is more important to OP, I guess. – asongtoruin Feb 28 '18 at 12:18
  • Thanks. To me, the pivot_table method seems more simple. What would be the advantages of using unstack here? Or is it just preference? – brendxn Feb 28 '18 at 18:57
  • @brendx - I think pivot_table is used only if need aggregate function. So here is faster and better use set_index + unstack. If duplicates, then is better groupby + aggregate function what is faster alternative of pivot_table. But it is up to you. Good luck! – jezrael Feb 28 '18 at 19:09