2

I have 2 dataframes and I want to join by name, but names are not written exactly the same:

Df1:

ID    Name       Age
1     Jose        13
2     M. Jose     12
3     Laura        8
4     Karol P     32

Df2:
Name         Surname
José           Hall
María José     Perez
Laura          Alza
Karol          Smith


I need to join and get this:

ID    Name       Age   Surname
1     Jose        13     Hall
2     M. Jose     12     Perez
3     Laura        8     Alza
4     Karol P     32     Smith

How to consider that the names are not exactly the same before to join?

Alexa
  • 99
  • 1
  • 6

2 Answers2

2

You can get close to your result using stringdist_left_join from fuzzyjoin

library(fuzzyjoin)
stringdist_left_join(df1, df2, by = "Name")

#  ID  Name.x Age Name.y Surname
#1  1    Jose  13   José    Hall
#2  2 M. Jose  12   <NA>    <NA>
#3  3   Laura   8  Laura    Alza
#4  4 Karol P  32  Karol   Smith

For the example shared it does not work for 1 entry since it is difficult to match Maria with M.. You can get the result for it by adjusting the max_dist argument to a higher value (default is 2) however, this will screw up other results and would give unwanted matches. If you have minimal NA entries (like the example shared) after this join you could just match them by "hand".

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

I would clean the database before (for example deleting those ´, in excel is easy doing those replace) and then use

 new_df <- merge(df1, df2, by="name")

or you could try to assign an ID for df2 that coincide with df2 if it is possible.