Merge Dataframe by regular expression or fuzzy match

Question

I have d1 and d2 and I want to merge the two by ID column. However, the ID and ID2 are not exactly match. Instead, ID is the first 8 digit of ID2 (sometimes it can be the first 6 digit, or sometimes it can be one or two digit different).

I understand that I can pre-process ID2 to keep only the first 8 digit. However, I cannot handle all the situations.

I wonder is there an advanced way to merge through regular expression for fuzzy match? say, if the first 6 digits match, then merge?

d1=pd.DataFrame({'ID':['00846U10','01381710'],
                      'count':[100,200]})
d2=pd.DataFrame({'ID2':['00846U101','013817101','02376R102'],
                'value':[1,5,6]})

pre-process ID and ID2 to keep only 6 characters, wont that work!! — Gaurav Dhama, Aug 03 '16 at 20:07
@GauravDhama it won't handle all of my situations. updated the question — Lisa, Aug 03 '16 at 20:09
So, you want to slice both `ID` and `ID2` to contain 6 digits and if there's a match, then merge the two dataframes? — Nickil Maveli, Aug 03 '16 at 20:29

score 4 · Accepted Answer · answered Aug 03 '16 at 20:08

4

dude,

I have had the same problem and the only solution is to use other python packages. Have a look at fuzzywuzzy for instance. Its very good.

The general idea is that, for every row in d1, you will look for the row in d2 that has the highest fuzzy matching score.

answered Aug 03 '16 at 20:08

ℕʘʘḆḽḘ

18,566
34
128
235

Thanks @Noobie would you please post a sample code? – Lisa Aug 03 '16 at 20:11
sure look here http://stackoverflow.com/questions/37979167/how-to-parallelize-many-fuzzy-string-comparisons-using-apply-in-pandas. if you are satisfied you can accept. thanks! – ℕʘʘḆḽḘ Aug 03 '16 at 20:35

Merge Dataframe by regular expression or fuzzy match

1 Answers1