I am in the process of matching a number of data sets. These are passenger arrivals from a number of different systems. I need to match these as best as possible. 2% unique in each set, the rest common.
I am not trying to merge, deduplicate, or standardise the data as is normally the case with fuzzy look up. I am trying to find the quality, value and location of the closest match. Other then the common fields the data sets have a whole bunch of unique fileds. Essentially am trying to find a link between these so that I can create reports with the different data sets, each of which has information I need. These have over 100k rows.
I have made the common fields into a sting to simplify the calculations. The fields are arrival date (in excel number format), DOB, Passport and full name. i.e. "44250 | 15-JAN-80 | UK1234567 | JOHN AMITH"
Essentially starting with Table1, I want to add 3 columns; the nearest match in text, the ID associated with this value in the second table or the row number so I can index/match the data and finally the percentage similarity as per example.
I have found functions that find the nearest match, but not the location, or associated ID. Any ideas how the below would work or any other ideas.
MADEUP VALUES
TABLE 1 REF TABLE 1 ID 44054 | 29-Aug-1960 | CL-F2944458 | JOHN THOMSON ID1-010739 44054 | 09-Dec-1989 | LM389990 | EDWARD SMITH ID1-010737 44054 | 09-Dec-1991 | LL556699 | RICHARD FREEMAN ID1-010738 44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID1-010740 44054 | 05-Nov-1954 | US 9910505 | BEN JONES ID1-010753
TABLE 2 REF TABLE 2 ID 44054 | 05-Nov-1954 | US 9910505 | BENJAMIN JONES ID2-0001 44059 | 19-Aug-1960 | CL-F2944458 | JOHN THOMSON ID2-0002 44054 | 09-Dec-1991 | LL556666 | RICHARD FREEMAN ID2-0003 44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID2-0004 44054 | 09-Nov-1989 | AU-LM389990 | EDWARD SMTH ID2-0005