Summing Row Values Based On Substring in Column

Question

I have a Dataframe containing transactions from an account and I need to categorize them and sum each category. I have 8 categories in total and can have 10+ keywords in each. I have found posts on this site that accomplish something similar but they have their caveats.

df['Contains string'] = (df[columns].apply(lambda x: x.str.contains('myString'))).sum(axis=1)

This is the closest I can find but it doesn't allow for multiple keywords or categories.

Examples of categories:

Phone payment - contains string "ATT Payment"
Supplies - contains string "EBAY" or "BIZCARD" or "AMAZON" or etc.

The DataFrame can have unlimited rows but a sample looks like:

                                         TransactionID    Debit   Credit
0                                   ATT Payment ID 001     0.00    21.29
1                                      BIZCARD PAYMENT     0.00   124.93
2                                          Check #1867  8755.50     0.00
3                                          Check #1874   110.66     0.00
5                                          EBAY ID 544     0.00    36.00
6                                        AMAZONPAY 788     0.00   373.20

The credit column would be the only that needs values to be summed.

Desired Output:

Phone Payment = 21.29  
Supplies = 534.13
Etc. = 00

New to Python. Pardon the lack or surplus of information. Thanks.

does the desired output is for a particular category ? or you want to get "EBAY" or "BIZCARD" or "AMAZON" only ? — dper, Jun 20 '20 at 20:48

Trenton McKinney · Accepted Answer · 2020-06-20T21:01:58.163

Use pandas.Series.str.contains

.contains() is then used for Indexing and selecting data
Create a mask for each set of desired constraints
The masks can then be used with a ~, which is not, for the etc data.

import pandas as pd

# create dataframe
data = {'TransactionID': ['ATT Payment ID 001', 'BIZCARD PAYMENT', 'Check #1867', 'Check #1874', 'EBAY ID 544', 'AMAZONPAY 788'],
        'Debit': [0.0, 0.0, 8755.5, 110.66, 0.0, 0.0],
        'Credit': [21.29, 124.93, 0.0, 0.0, 36.0, 373.2]}

df = pd.DataFrame(data)

# Get rows based on criteria of contains
phone_payment_mask = df.TransactionID.str.contains('ATT')
phone_payment = df.Credit[phone_payment_mask]
print(phone_payment.sum())
21.29

supplies_mask = df.TransactionID.str.contains('EBAY|BIZCARD|AMAZON')
supplies = df.Credit[supplies_mask]
print(supplies.sum())
534.13

etc = df.Credit[~phone_payment_mask & ~supplies_mask]
print(etc.sum())
0.0

score 0 · Answer 2 · answered Jun 20 '20 at 20:48

0

You can start by adding a category column as described in this answer (use .str.contains instead of .isin)

Then you can do the sum using a groupby :

df.groupby('Category_column').Credit.sum()

answered Jun 20 '20 at 20:48

Hugolmn

1,530
1
7
20

Summing Row Values Based On Substring in Column

2 Answers2

Use pandas.Series.str.contains