1

I have a Dataframe containing transactions from an account and I need to categorize them and sum each category. I have 8 categories in total and can have 10+ keywords in each. I have found posts on this site that accomplish something similar but they have their caveats.

df['Contains string'] = (df[columns].apply(lambda x: x.str.contains('myString'))).sum(axis=1)

This is the closest I can find but it doesn't allow for multiple keywords or categories.

Examples of categories:

Phone payment - contains string "ATT Payment"
Supplies - contains string "EBAY" or "BIZCARD" or "AMAZON" or etc.

The DataFrame can have unlimited rows but a sample looks like:

                                         TransactionID    Debit   Credit
0                                   ATT Payment ID 001     0.00    21.29
1                                      BIZCARD PAYMENT     0.00   124.93
2                                          Check #1867  8755.50     0.00
3                                          Check #1874   110.66     0.00
5                                          EBAY ID 544     0.00    36.00
6                                        AMAZONPAY 788     0.00   373.20

The credit column would be the only that needs values to be summed.

Desired Output:

Phone Payment = 21.29  
Supplies = 534.13
Etc. = 00

New to Python. Pardon the lack or surplus of information. Thanks.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Jacob Baum
  • 27
  • 5
  • does the desired output is for a particular category ? or you want to get "EBAY" or "BIZCARD" or "AMAZON" only ? – dper Jun 20 '20 at 20:48

2 Answers2

3

Use pandas.Series.str.contains

  • .contains() is then used for Indexing and selecting data
  • Create a mask for each set of desired constraints
  • The masks can then be used with a ~, which is not, for the etc data.
import pandas as pd

# create dataframe
data = {'TransactionID': ['ATT Payment ID 001', 'BIZCARD PAYMENT', 'Check #1867', 'Check #1874', 'EBAY ID 544', 'AMAZONPAY 788'],
        'Debit': [0.0, 0.0, 8755.5, 110.66, 0.0, 0.0],
        'Credit': [21.29, 124.93, 0.0, 0.0, 36.0, 373.2]}

df = pd.DataFrame(data)

# Get rows based on criteria of contains
phone_payment_mask = df.TransactionID.str.contains('ATT')
phone_payment = df.Credit[phone_payment_mask]
print(phone_payment.sum())
21.29

supplies_mask = df.TransactionID.str.contains('EBAY|BIZCARD|AMAZON')
supplies = df.Credit[supplies_mask]
print(supplies.sum())
534.13

etc = df.Credit[~phone_payment_mask & ~supplies_mask]
print(etc.sum())
0.0
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
0

You can start by adding a category column as described in this answer (use .str.contains instead of .isin)

Then you can do the sum using a groupby :

df.groupby('Category_column').Credit.sum()
Hugolmn
  • 1,530
  • 1
  • 7
  • 20