2

I have some problems with pandas dataframes join. This is frustrating because it is full of the misspellings and errors. There are two dataframes: the customer dataframe (cust_df) and the credit cards dataframe (cc_df). Both dataframes contains : its id column, name, date of birth, national id number and mobile number. I want to get CUST_ID (from custdf) for every CC_ID (from ccdf). Usually we can do pd.join then stated the key columns; however in this case the unique keys are not accurate (some of them are wrong) thats why multiple columns are needed. Below are the example:
Sample data for cust_df:

CUST_ID , NAME  ,  DOB, NTLNO   , MOBILE  
231412 ,W CUMI  , 17/02/1989 00:00    , 952547890   ,85320818335  
123433 ,S KAMU W  , 28/04/1989 00:00    , 444466213  ,81217354673  
324123 ,K D SUMI    , 05/01/1983 00:00    , 677676999   ,82938489231  
324123 ,M SAMI    , 17/02/1989 00:00    , 657567577  , 81217354674  
231432 ,MOSE N  , 17/02/1989 00:00    , 123423355   ,83320818575  

Sample data for cc_df:

CC_ID , FNAME,MNAME,LNAME,DOB,NTLNO,MOBILE  
1234    , MOSE,    NAMU,        ,17/02/1989 00:00 ,   123423355,   83320818575  
2345    , MAWON,       ,SAMI,    17/02/1989 00:00,    657567577,   81217354674  
3243    , KOME,    DODO,    SUMI,    05/01/1983 00:00,    76999,   82938489331  
4365    , SAPI,    KAMU,    WEWE ,   28/04/1989 00:00 ,   444466213,   81217354673  
4534    , WAKU,    CUMI,        ,17/02/1989 00:00 ,   52547890,    85320818335  

The problems are:
1. Name, In cc_df, the names are separated while in cust_df names are concatenated, sometimes in its short form. Example: Sumi Cumi -> S Cumi.
2. NTLNO & MOBILE NTLNO & MOBILE are unique identifiers but because of the wrong input some of it are wrong.

What is the most accurate way to get the CUST_ID for every CC_ID ?

  • Welcome to SO. Can you please [edit](https://stackoverflow.com/posts/50211473/edit) your question, including data as test rather than links / images. This will let us copy-paste and test solutions. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jpp May 07 '18 at 09:59

1 Answers1

0

You can use merge_asof:

cust_df.sort_values(by='MOBILE', inplace=True)
cc_df.sort_values(by='MOBILE', inplace=True)

pandas.merge_asof(cc_df, cust_df, on='MOBILE', direction='nearest')

For this method to work keys need to be sorted.

zipa
  • 27,316
  • 6
  • 40
  • 58