2

So I have this pandas DataFrame, with 5 columns and like 100000 rows. Here's example:

    V1      V2          V3          V4  V5
0   2014    Alfa Romeo  159         1   157
1   2014    Alfa Romeo  GIULIETTA   1   119
2   2014    Alfa Romeo  GIULIETTA   3   119

What I want to do is to sum values in column V4 IF values in V1, V2, V3 and V5 are exactly the same.

    V1      V2          V3          V4  V5
0   2014    Alfa Romeo  159         1   157
1   2014    Alfa Romeo  GIULIETTA   4   119

At first I thought that groupby would do the job, but when I did

df.groupby(['V1', V2','V3', 'V5' ]).sum()

I lost some of the information in column V3, for example, I should have 10 different types of colors but now I have only 3. How to solve that?

MikolajM
  • 300
  • 1
  • 13
  • `lost some of the information in column V3` Can you elaborate? – cs95 Dec 21 '17 at 16:19
  • Does `df.groupby(['V1', 'V2', 'V3', 'V5']).V4.sum().reset_index().reindex(columns=df.columns)` work for you? – cs95 Dec 21 '17 at 16:19
  • lost some of the information in column V3 Can you elaborate? – cᴏʟᴅsᴘᴇᴇᴅ Sure. In my database, I have like 10 types for Volvo (V3 column): XC60 V60 V40 XC70 S60 S80 V70 V40 CROSS COUNTRY 13 XC90 10 V60 HYBRID 1 and after I make df.groupby(['V1', 'V2', 'V3','V5' ]).agg([ "sum"]) It shows that there's only V70, XC60, XC70, XC90. Is that a bug in pandas? What do you think? – MikolajM Dec 21 '17 at 16:29
  • df.groupby(['V1', 'V2', 'V3','V5' ],as_index=False).agg([ "sum"]) – BENY Dec 21 '17 at 16:42

1 Answers1

3

I suspect you are dealing with missing data in 'V1','V2','V3' or 'V5' columns and this data is omitted from groupby's. See this SO Post

A workaround is to use fillna():

df.fillna("Missing").groupby(['V1','V2','V3','V5']).V4.sum()

Example:

print(df)

       V1          V2         V3  V4   V5
0  2014.0  Alfa Romeo        159   1  157
1  2014.0  Alfa Romeo  GIULIETTA   1  119
2  2014.0  Alfa Romeo  GIULIETTA   3  119
3     NaN  Alfa Romeo      Black   4  119

df.groupby(['V1','V2','V3','V5']).V4.sum()

Missing 'Black':

V1      V2          V3         V5 
2014.0  Alfa Romeo  159        157    1
                    GIULIETTA  119    4
Name: V4, dtype: int64

Using fillna():

df.fillna("Missing").groupby(['V1','V2','V3','V5']).V4.sum()

V1       V2          V3         V5 
2014.0   Alfa Romeo  159        157    1
                     GIULIETTA  119    4
Missing  Alfa Romeo  Black      119    4
Name: V4, dtype: int64
Scott Boston
  • 147,308
  • 15
  • 139
  • 187