0

I'm trying to take subcategories and place them into columns so I can compute values for each column.

For example, what I have now is:

   c1   c2     c3
0  123  Orange 12
1  123  Car    15
2  123  Blue   14
3  123  Bike   13
4  234  Red    09
5  234  Bus    04
6  234  Train  19
7  234  Purple 17

What I want instead is:

   c1   c3 (colour)  c3 (transport)
0  123  26           28
1  234  26           23

Obviously with about ~20,000 rows and 50 columns.

I've tried using groupby and some other functions but my brain's a bit frazzled and I'm struggling to figure out how to get it to work in the way I want to.

Right now I'm using something to the effect of:

for x in df.c1.unique:
    df['c3 (colour)'] = df[(df[c1] == x) & (df[c2] == colour)].sum()

I'm aware that's a horrendously inefficient way to do it, but right now it's the only way I'm getting the table format that I want. I'm still trying to work out how to use pandas correctly so I'm a little unfamiliar with certain commands that could make it work quicker.

  • Thanks for explaining this so clearly! Can definitely empathise with frazzled brain syndrome when trying to do this kind of thing. I feel like `groupby` is definitely the way to go. Could you show us what you're code was when you were trying that? And have you had a look at this question: https://stackoverflow.com/questions/39922986/pandas-group-by-and-sum I feel like one of those answers might be close to what you want, but perhaps with a bit that you need slightly different? Also, are you bothered about the other ~50 columns, or are these the only important ones for this? – Tim Sep 02 '19 at 21:59

1 Answers1

0

Something like this will do the trick for you:

First if you need breakdown per Colour/Transport - you need to classify it accordingly, so:

>>> df
    c1      c2  c3         c4
0  123  Orange  12     Colour
1  123     Car  15  Transport
2  123    Blue  14     Colour
3  123    Bike  13  Transport
4  234     Red   9     Colour
5  234     Bus   4  Transport
6  234   Train  19  Transport
7  234  Purple  17     Colour

Then in order to get exactly what you want (so kind of aggregation with "sumif"):

>>> df.assign(c3_Colour=df["c3"][df["c4"]=="Colour"], c3_Transport=df["c3"][df["c4"]=="Transport"]).fillna(0).groupby(c1).agg({"c3_Colour":sum, "c3_Transport": sum})
     c3_Colour  c3_Transport
123       26.0          28.0
234       26.0          23.0
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34