0

I am in the process of matching a number of data sets. These are passenger arrivals from a number of different systems. I need to match these as best as possible. 2% unique in each set, the rest common.

I am not trying to merge, deduplicate, or standardise the data as is normally the case with fuzzy look up. I am trying to find the quality, value and location of the closest match. Other then the common fields the data sets have a whole bunch of unique fileds. Essentially am trying to find a link between these so that I can create reports with the different data sets, each of which has information I need. These have over 100k rows.

I have made the common fields into a sting to simplify the calculations. The fields are arrival date (in excel number format), DOB, Passport and full name. i.e. "44250 | 15-JAN-80 | UK1234567 | JOHN AMITH"

Essentially starting with Table1, I want to add 3 columns; the nearest match in text, the ID associated with this value in the second table or the row number so I can index/match the data and finally the percentage similarity as per example.

I have found functions that find the nearest match, but not the location, or associated ID. Any ideas how the below would work or any other ideas.

example

MADEUP VALUES

TABLE 1 REF TABLE 1 ID 44054 | 29-Aug-1960 | CL-F2944458 | JOHN THOMSON ID1-010739 44054 | 09-Dec-1989 | LM389990 | EDWARD SMITH ID1-010737 44054 | 09-Dec-1991 | LL556699 | RICHARD FREEMAN ID1-010738 44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID1-010740 44054 | 05-Nov-1954 | US 9910505 | BEN JONES ID1-010753

TABLE 2 REF TABLE 2 ID 44054 | 05-Nov-1954 | US 9910505 | BENJAMIN JONES ID2-0001 44059 | 19-Aug-1960 | CL-F2944458 | JOHN THOMSON ID2-0002 44054 | 09-Dec-1991 | LL556666 | RICHARD FREEMAN ID2-0003 44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID2-0004 44054 | 09-Nov-1989 | AU-LM389990 | EDWARD SMTH ID2-0005

Levenshtein Distance in VBA

Fuzzy matching Mr Excel

github Fuzzy

Scottyp
  • 111
  • 1
  • 10
  • You can do fuzzy matching in Power Query. – Ron Rosenfeld Jun 04 '21 at 18:24
  • No not really fuzzy will do a basic match but it won't return the percent, or where it sits. It also takes fricking ages so is not an efficient way to do it. – Scottyp Jun 04 '21 at 22:24
  • I've never run a fuzzyJoin on a large data set so was not aware about the time issue. But what do you mean about it not really being a fuzzy match? Let me look into the issue of returning the similarity value. – Ron Rosenfeld Jun 05 '21 at 00:23
  • Apparently only the online Power Query version has the option for returning the similarity value for the fuzzy match. So I guess you'd have to do multiple matches with different thresholds to do it in other versions. By the way, PQ is said to use the Jaccard similarity algorithm for accomplishing fuzzy matches. – Ron Rosenfeld Jun 05 '21 at 00:48
  • At work I am stuck with excel 2016 also, which doesn't have it at all. It's really more getting the location of the match so can create combined reports. – Scottyp Jun 06 '21 at 01:58

1 Answers1

0

I needed to use fuzzy matching in Excel for work, and I also needed to know string similarity, be able to partition sentences, etc.

I created a VBA module for doing just this, I think it may help you: https://github.com/kyledeer-32/vba_fuzzymatching.git

Basically, importing it into your workbook will give you access to several UDFs, e.g., fuzzy match, string similarity, etc.

Note: it won't return the cell index of a matched value, but you could configure the scripts to do this fairly easy, e.g., with just a few changes, you could configure the "=fuzzy_match" function to return the array position of the best match instead of the value itself.

Hope this helps!

Kyle Deer
  • 91
  • 8