I have some problems with pandas dataframes join. This is frustrating because it is full of the misspellings and errors. There are two dataframes: the customer dataframe (cust_df) and the credit cards dataframe (cc_df). Both dataframes contains : its id column, name, date of birth, national id number and mobile number. I want to get CUST_ID (from custdf) for every CC_ID (from ccdf). Usually we can do pd.join then stated the key columns; however in this case the unique keys are not accurate (some of them are wrong) thats why multiple columns are needed. Below are the example:
Sample data for cust_df:
CUST_ID , NAME , DOB, NTLNO , MOBILE
231412 ,W CUMI , 17/02/1989 00:00 , 952547890 ,85320818335
123433 ,S KAMU W , 28/04/1989 00:00 , 444466213 ,81217354673
324123 ,K D SUMI , 05/01/1983 00:00 , 677676999 ,82938489231
324123 ,M SAMI , 17/02/1989 00:00 , 657567577 , 81217354674
231432 ,MOSE N , 17/02/1989 00:00 , 123423355 ,83320818575
Sample data for cc_df:
CC_ID , FNAME,MNAME,LNAME,DOB,NTLNO,MOBILE
1234 , MOSE, NAMU, ,17/02/1989 00:00 , 123423355, 83320818575
2345 , MAWON, ,SAMI, 17/02/1989 00:00, 657567577, 81217354674
3243 , KOME, DODO, SUMI, 05/01/1983 00:00, 76999, 82938489331
4365 , SAPI, KAMU, WEWE , 28/04/1989 00:00 , 444466213, 81217354673
4534 , WAKU, CUMI, ,17/02/1989 00:00 , 52547890, 85320818335
The problems are:
1. Name, In cc_df, the names are separated while in cust_df names are concatenated, sometimes in its short form. Example: Sumi Cumi -> S Cumi.
2. NTLNO & MOBILE
NTLNO & MOBILE are unique identifiers but because of the wrong input some of it are wrong.
What is the most accurate way to get the CUST_ID for every CC_ID ?