0

I have two dataframes of size roughly 1,000,000 rows each. Both share a common 'Address' column which I am using to join the dataframes. Using this join, I wish to move information, which I shall call 'details', from dataframe1 to dataframe2.

df2.details = df2.Address.map(dict(zip(df1.Address,df1.details)))

However, the address column does not exhibit entire commonality. I tried cleaning as best I could, but still can only move roughly 40% of the data across. Is there a way to modify my above code to allow for a partial match? I'm totally stumped on this one.

Data is quite simply as described. Two small dataframes. Fabricated sample data below:

df1 
Address                                    Details
Apt 15 A, Long Street, Fake town, US       A   


df2
Address                                    Details
15A, Long Street, Fake town, U.S.              
martineau
  • 119,623
  • 25
  • 170
  • 301
Ciaran O Brien
  • 374
  • 3
  • 13
  • Can you give an example of the data? – Jack Moody Apr 01 '19 at 21:24
  • @JackMoody There is some fabricated sample data added. Not at all dissimilar from the actual data which I cannot share. As you can see, there are discrepancies between the addresses in the dataframes on which I need to join. 1m plus rows like this. – Ciaran O Brien Apr 01 '19 at 21:30

1 Answers1

0

First, I would recommend performing the join operation and identifying the rows in each data frame that do not have a perfect match. Once you have identified these rows, exclude the others and proceed with the following suggestions:

  • One approach is to parse the addresses and attempt to standardize them. You might try using the usaddress module to standardize your addresses.

  • You could also try the approaches recommended in answer to this question, although they may take some tweaking for your case. It's hard to say without multiple examples of the partial string matches.

  • Another approach would be to use the Google Maps API (or Bing or MapQuest) for address standardization, though with over million rows per data frame you will far out strip the free API calls/day and would need to pay for the service.

  • A final suggestion is to use the fuzzywuzzy module for fuzzy (approximate) string matching.

Nathaniel
  • 3,230
  • 11
  • 18
  • Thanks for getting back to me on that. Certainly an API is out of the question. I will however take a look at fuzzywuzzy and see can I integrate it into my existing code. Essentially I would like to map across the data should there be perhaps an 80% match. Right now, my code only takes those with exact matches. I will continue to try standardize too which is my current approach. Thanks for sharing the other post too. – Ciaran O Brien Apr 01 '19 at 21:50
  • No problem. Feel free to accept or upvote if the answer was helpful, and good luck! – Nathaniel Apr 01 '19 at 22:19
  • Sure, no problem. Thank you. – Ciaran O Brien Apr 02 '19 at 20:55