How to create new columns from subcategories in pandas?

Question

I'm trying to take subcategories and place them into columns so I can compute values for each column.

For example, what I have now is:

   c1   c2     c3
0  123  Orange 12
1  123  Car    15
2  123  Blue   14
3  123  Bike   13
4  234  Red    09
5  234  Bus    04
6  234  Train  19
7  234  Purple 17

What I want instead is:

   c1   c3 (colour)  c3 (transport)
0  123  26           28
1  234  26           23

Obviously with about ~20,000 rows and 50 columns.

I've tried using groupby and some other functions but my brain's a bit frazzled and I'm struggling to figure out how to get it to work in the way I want to.

Right now I'm using something to the effect of:

for x in df.c1.unique:
    df['c3 (colour)'] = df[(df[c1] == x) & (df[c2] == colour)].sum()

I'm aware that's a horrendously inefficient way to do it, but right now it's the only way I'm getting the table format that I want. I'm still trying to work out how to use pandas correctly so I'm a little unfamiliar with certain commands that could make it work quicker.

Thanks for explaining this so clearly! Can definitely empathise with frazzled brain syndrome when trying to do this kind of thing. I feel like `groupby` is definitely the way to go. Could you show us what you're code was when you were trying that? And have you had a look at this question: https://stackoverflow.com/questions/39922986/pandas-group-by-and-sum I feel like one of those answers might be close to what you want, but perhaps with a bit that you need slightly different? Also, are you bothered about the other ~50 columns, or are these the only important ones for this? — Tim, Sep 02 '19 at 21:59

score 0 · Answer 1 · answered Sep 02 '19 at 21:58

Something like this will do the trick for you:

First if you need breakdown per Colour/Transport - you need to classify it accordingly, so:

>>> df
    c1      c2  c3         c4
0  123  Orange  12     Colour
1  123     Car  15  Transport
2  123    Blue  14     Colour
3  123    Bike  13  Transport
4  234     Red   9     Colour
5  234     Bus   4  Transport
6  234   Train  19  Transport
7  234  Purple  17     Colour

Then in order to get exactly what you want (so kind of aggregation with "sumif"):

>>> df.assign(c3_Colour=df["c3"][df["c4"]=="Colour"], c3_Transport=df["c3"][df["c4"]=="Transport"]).fillna(0).groupby(c1).agg({"c3_Colour":sum, "c3_Transport": sum})
     c3_Colour  c3_Transport
123       26.0          28.0
234       26.0          23.0

How to create new columns from subcategories in pandas?

1 Answers1

Linked