How to efficiently match strings across two dataframes

Question

This is a question about theory as much as it's one about syntax. I'm writing a program to classify transactions in my bank statements, and I'm stuck on the best way on doing it.

I have a series of 'annual statement' dataframes that each contain a year's worth of transactions. These include the columns:

| description      | type           |
| sainsbury's      |                |
| js sainsbury     |                |
| amzn marketplace |                |
| mbna             |                |
| amazon           |                |
| netflix          |                |
| gap              |                |
| vue              |                |

Where description is the name of a vendor that took part in a transaction and type is an empty column that I want to fill.

I also have a 'classifications' dataframe, which looks like:

| type           | search term      |
| groceries      | sainsbury        |
| amazon         | amzn marketplace |
| amazon         | amazon           |
| subscriptions  | netflix          |
| clothes        | gap              |
| luxuries       | vue              |

Where type is a set of things I've decided provide a nice summary of what I spend my money on, and search term is a term that can be used to identify which transactions relate to which classifications.

Basically, I need to match the search terms in the classifications dataframe to the description column in the annual statement dataframe, then write the corresponding entry from the type column in the classifications dataframe to the type column in the annual statements dataframe. In other words, I'd end up with this (note that where there was no type for 'mbna' in the classifications dataframe, the type column is left blank):

| description      | type           |
| sainsbury's      | groceries      |
| js sainsbury     | groceries      |
| amzn marketplace | amazon         |
| mbna             |                |
| amazon           | amazon         |
| netflix          | subscriptions  |
| gap              | clothes        |
| vue              | luxuries       |

As for scale, there's thousands of rows in each annual statement dataframe, around ten different types, and probably less than a hundred search terms.

I've had a few thoughts about how to do this, but I'm a beginner with Pandas so I'd appreciate some expert help!

Daniel Geffen · Accepted Answer · 2020-04-13T17:02:30.923

1

This seems like a pretty simple merge. You can specify different columns to match in the different dataframes. Something like this should work:

result_df = annual_statemenets_df.merge(classifications_df, left_on="description", right_on="search term")

If you are concerned about efficiency you could set the matching columns as indexes and use join, which is faster (as stated here).

If you want to match parts of strings with your search terms you should use replace, which supports regex. First you need to convert your search terms to regexes that select the whole phrase when finding the term in it, like this:

classifications_df["search term"] = ".*" + classifications_df["search term"] + ".*"

Then replace will replace the whole phrase:

to_replace = classifications_df["search term"].tolist()
replace_with = classifications_df["type"].tolist()

annual_statemenets_df["type"] = annual_statemenets_df["description"].replace(to_replace, replace_with, regex=True)

edited Apr 13 '20 at 17:02

answered Apr 11 '20 at 20:24

Daniel Geffen

1,777
1
11
16

This would work when you want exact matches, but I think for this purpose you would want something a little more fexible. For example: both "sainsbury" strings would be lacking a type eventhough they can be classified. – Est Apr 11 '20 at 20:36
I updated my answer to give an option for flexible searching. Is this what you are looking for? – Daniel Geffen Apr 11 '20 at 23:44
I like the merge solution. Those databasey operations look really powerful, but I've kinda ignored them because they're new to me. However, like Esteban said, I do need flexible searching, so the replace suggestion looks really good. I wouldn't of thought of using replace like that. I'll give it a shot. Thanks! – user4896331 Apr 13 '20 at 08:29
I tried out the replace method and it doesn't appear to quite work. For example, the row with "sainsbury's" in the description is given the type "groceries's" rather than just "groceries". Similarly, the row with "js sainsbury" is given the type "js groceries". The replace operation is correctly identifying the rows for the replace operation. Thanks though! I'll see if I can get something working with the replace function. – user4896331 Apr 13 '20 at 16:30
You're right. I added a line of code to the answer that should solve this. – Daniel Geffen Apr 13 '20 at 17:03
Yes! That works! One thing that's not quite right (my fault, I should have specified) is that if there is no 'type' in the classifications dataframe that fits a particular description, then the type column in the annual statement dataframe should be left blank. I edited the question about to be clear. Is this possible? – user4896331 Apr 13 '20 at 19:08
I would just add a row to `classifications_df` with an empty string in both columns that will catch all the descriptions that don't fit. – Daniel Geffen Apr 13 '20 at 22:21

How to efficiently match strings across two dataframes

1 Answers1