This is a question about theory as much as it's one about syntax. I'm writing a program to classify transactions in my bank statements, and I'm stuck on the best way on doing it.
I have a series of 'annual statement' dataframes that each contain a year's worth of transactions. These include the columns:
| description | type |
| sainsbury's | |
| js sainsbury | |
| amzn marketplace | |
| mbna | |
| amazon | |
| netflix | |
| gap | |
| vue | |
Where description
is the name of a vendor that took part in a transaction and type
is an empty column that I want to fill.
I also have a 'classifications' dataframe, which looks like:
| type | search term |
| groceries | sainsbury |
| amazon | amzn marketplace |
| amazon | amazon |
| subscriptions | netflix |
| clothes | gap |
| luxuries | vue |
Where type
is a set of things I've decided provide a nice summary of what I spend my money on, and search term
is a term that can be used to identify which transactions relate to which classifications.
Basically, I need to match the search terms in the classifications dataframe to the description column in the annual statement dataframe, then write the corresponding entry from the type column in the classifications dataframe to the type column in the annual statements dataframe. In other words, I'd end up with this (note that where there was no type for 'mbna' in the classifications dataframe, the type column is left blank):
| description | type |
| sainsbury's | groceries |
| js sainsbury | groceries |
| amzn marketplace | amazon |
| mbna | |
| amazon | amazon |
| netflix | subscriptions |
| gap | clothes |
| vue | luxuries |
As for scale, there's thousands of rows in each annual statement dataframe, around ten different types, and probably less than a hundred search terms.
I've had a few thoughts about how to do this, but I'm a beginner with Pandas so I'd appreciate some expert help!