2

How do I create a new row with sum based on condition using pandas?

Initial table -

Product   Date  CAT    Value
Product A Apr   F31    100
Product A Apr   F32    200
Product A Apr   F45    300
Product A Apr   F46    400
Product A May   F31    200
Product A May   F32    300
Product A May   F45    400
Product A May   F46    500
Product B Apr   F31    200
Product B Apr   F32    300
Product B Apr   F45    400
Product B Apr   F46    500
Product B May   F31    600
Product B May   F32    700
Product B May   F45    800
Product B May   F46    900

I wanted to create this by combining F31 and F32 into F3 and it should give me the sum in value.

Product   Date  CAT    Value
Product A Apr   F3     300
Product A Apr   F45    300
Product A Apr   F46    400
Product A May   F3     500
Product A May   F45    400
Product A May   F46    500
Product B Apr   F3     500
Product B Apr   F45    400
Product B Apr   F46    500
Product B May   F3    1300
Product B May   F45    800
Product B May   F46    900

Can you please help me?

Santoo
  • 355
  • 2
  • 10
  • is `31` a constant that you need to replace with `3` ? just want to make sure as your example below you said you had two different text fields take your time updating your sample i'll check again in a while – Umar.H Sep 28 '20 at 15:52
  • @Manakin I want to combine F31 and F32 to F3. There are also other categories like 'Dep Single' and 'D. Recurrent', 'D. Combined' which has to be combined to 'Dep All'. If the code is dynamic would work. – Santoo Sep 28 '20 at 15:58
  • sorry for late reply got pulled into a meeting, this is difficult as I feel you have many requirements that can't be put into one line of code.. you'll need to create a helper columns to group by – Umar.H Sep 28 '20 at 17:17
  • @Manakin Thanks for getting back. Appreciate it. Yes I have taken the helper column route for now. – Santoo Sep 28 '20 at 17:20
  • Perhaps the critical part of your question is `series.replace`, but here's a [Q&A: Aggregation in pandas](https://stackoverflow.com/q/53781634/6692898) that covers `groupby/sum` among other use cases – RichieV Sep 29 '20 at 03:26
  • And here is a nice post for [replace vs map](https://stackoverflow.com/q/20250771/6692898) – RichieV Sep 29 '20 at 03:35
  • @RichieV Thank you. – Santoo Sep 29 '20 at 20:25

2 Answers2

2

For Creating the DataFrame as mentioned above, we need to apply two operations.

  1. String .replace

  2. .groupby with Aggregation i.e. .sum() and .reset_index()

  3. String Replace method is used to replace the characters in strings, by which we can convert the string i.e. F31, F32 to F3.

    .replace('F31','F3')

    .replace('F32','F3')

  4. .groupby method is used to separate the data, based on the Constraint Given. This Method Separates the Data into Blocks where operations can be performed. Here, we need to Group the Data based on multiple Columns.

Then we need to Aggregate the Sum of Value Column based on 'CAT' Column i.e. these are grouped first by 'Product', then by 'Date' and then by 'CAT'. So, for this we use .sum().

The result after this will be a grouped object, but as to convert it into DataFrame, we need to use .reset_index(), this method allows to fill the columns with corresponding values based on the input Conditioned Column.

Data.groupby(['Product','Date','CAT'])['Value'].sum().reset_index(name='Value')

The Code Snippet is show below:

Product = ['Product A','Product A','Product A','Product A','Product A','Product A','Product A','Product A','Product B','Product B','Product B','Product B','Product B','Product B','Product B','Product B']

Date = ['Apr','Apr','Apr','Apr','May','May','May','May','Apr','Apr','Apr','Apr','May','May','May','May']

CAT = ['F31','F32','F45','F46','F31','F32','F45','F46','F31','F32','F45','F46','F31','F32','F45','F46']

Value = [100, 200,300,400,200,300,400,500,200,300,400,500,600,700,800,900]

# Creating Data Frame

Data = pd.DataFrame({'Product':Product,'Date':Date,'CAT':CAT,'Value':Value})

# String Replace

Data['CAT'] = Data['CAT'].replace('F31','F3')

Data['CAT'] = Data['CAT'].replace('F32','F3')

# Group By Operation

DataG = pd.DataFrame(Data.groupby(['Product','Date','CAT'])['Value'].sum().reset_index(name='Value'))

Data Before applying above operations.

Data Before applying above operations

Data After applying above operations.

Data After applying above operations

Kranthi
  • 452
  • 4
  • 3
1

First let's do a targeted regex: replace only replacing a number that comes after CAT A

So CAT A5 --> CAT A

df['CAT'] = df['CAT'].str.replace('(CAT A)(\d+)',r'\1')

then do your group by.

df.groupby(['Product','Date','CAT'])['Value'].sum()
Product    Date  CAT  
Product A  Apr   CAT A     300
                 CAT B     300
                 CAT C     400
           Jul   CAT C     500
           Jun   CAT B     400
           May   CAT A     500
Product B  Apr   CAT A     500
                 CAT B     400
                 CAT C     500
           May   CAT A    1300
                 CAT B     800
                 CAT C     900
Name: Value, dtype: int64

Add .reset_index() if you want a dataframe returned.

halfer
  • 19,824
  • 17
  • 99
  • 186
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • Thank you. If it is not 'CAT' and named as something else, for e.g. 'type 1' and 'segment 1' instead of 'CAT A1' and 'CAT A2', how should the code work? – Santoo Sep 28 '20 at 15:33
  • @Santoo `df['CAT'].str.replace('(type|segment)(\s+)(\d+)',r'\1')` should work but you should update your sample to reflect your issue. – Umar.H Sep 28 '20 at 15:37