3

I have d1 and d2 and I want to merge the two by ID column. However, the ID and ID2 are not exactly match. Instead, ID is the first 8 digit of ID2 (sometimes it can be the first 6 digit, or sometimes it can be one or two digit different).

I understand that I can pre-process ID2 to keep only the first 8 digit. However, I cannot handle all the situations.

I wonder is there an advanced way to merge through regular expression for fuzzy match? say, if the first 6 digits match, then merge?

d1=pd.DataFrame({'ID':['00846U10','01381710'],
                      'count':[100,200]})
d2=pd.DataFrame({'ID2':['00846U101','013817101','02376R102'],
                'value':[1,5,6]})
Lisa
  • 4,126
  • 12
  • 42
  • 71

1 Answers1

4

dude,

I have had the same problem and the only solution is to use other python packages. Have a look at fuzzywuzzy for instance. Its very good.

The general idea is that, for every row in d1, you will look for the row in d2 that has the highest fuzzy matching score.

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • Thanks @Noobie would you please post a sample code? – Lisa Aug 03 '16 at 20:11
  • sure look here http://stackoverflow.com/questions/37979167/how-to-parallelize-many-fuzzy-string-comparisons-using-apply-in-pandas. if you are satisfied you can accept. thanks! – ℕʘʘḆḽḘ Aug 03 '16 at 20:35