0

I have a large spreadsheet (upwards of 119K rows) of mismatched data. Column A contains a list of names in full (and occasionally a Trustee or company name), and Column B contains initialized first/middle names with full names (and occasionally Trustee or company names).

I do not currently have a way to compare them short of doing so manually as there are many variable, and am looking for some assistance.

So far I have tried using a VBA script from (How do I fuzzy match just adjacent cells?) to see if it can output the difference (which would allow me to eliminate the cells in Column 2 that had no matching data), but this did not function as intended.

I have also tried various LEFT/RIGHT to trim the names from Column A and then match this to Column B, but this has also not worked due to variance in text in Column A.

Here are some examples of the cells. Note that the names in Column A are not always in alphabetical order, but Column B is:

Example (complete match):

Column A:                                             Column B:        
Smith Marcus John                                     J M Smith          
Page Binder Book, Quoth Nevermore Raven               B B Page, R N Quoth
Orange Apple Banana, Orange Pear Plum                 A B & P P Orange
Koala Bear, Koala Marsupial Pouch, Koala Gum Tree     B, P M & T G Koala
S & P Limited                                         S & P Limited
S & P Limited                                         A D Cumin (S & P Limited)

Example (partial):

Column A:                                                            Column B:        
Page Binder Book, Quoth Nevermore Raven                              B B Page
Orange Apple Banana, Orange Pear Plum                                A B & P P Orange (Fruit 2019 Limited)
Koala Bear, Koala Marsupial Pouch, Goanna Gumtree, Koala Gum Tree    B, P M & T G Koala

Example (no match):

Column A:                                             Column B:        
Smith Marcus John                                     H J Hyde          
Sheppard Garrus Thane                                 B B Page, R N Quoth

What I am hoping to do:

Firstly, I am hoping to correctly mark each cell in Column B as complete/partial/no match with a fill (green/yellow/red). Secondly, for partial matches (whether Column A has extra information, or Column B is missing information) I want to output in Column C the missing information, like so:

 Column A:                                        Column B:                  
 Page Binder Book, Quoth Nevermore Raven           B B Page
 Orange Apple Banana, Orange Pear Plum             A B & P P Orange (Fruit 2019 Limited)

 Column C:
 Quoth Nevermore Raven
 (Fruit 2019 Limited)

Is this kind of thing even possible, or are there just too many variations in the way the data is presented in each column?

Very new to both this site and excel functions in general, this is my first task!

Thank you for your assistance/knowledge/time.

  • I believe you need a Levenshtein calcualation. But your matches look too fuzzy even for that. Maybe you can set a limit based on the lenght of the strings – Andreas Aug 21 '19 at 05:33
  • I don't want to close the question as a duplicate so I just post the link here for you. https://stackoverflow.com/questions/4243036/levenshtein-distance-in-vba . I believe you need to split the words in the cells on space and compare each word that starts with "a" with the other columns "a" since the words is not always in the correct order. – Andreas Aug 21 '19 at 05:34
  • @Andreas, can you provide an example of what you mean by splitting the words in the cells on space? I do not understand. I've had a look at that link and it seems like my data is indeed too fuzzy... Edit: I am also looking to output the missing data, is this possible? – ScarredAnvil Aug 21 '19 at 06:14
  • It is possible. But it's a massive job. It's far too big of a question for SatckOverflow. To split a string use the VBA function `Split(" ", string)` – Andreas Aug 21 '19 at 06:26

1 Answers1

0

Importing and using this VBA module: https://github.com/kyledeer-32/vba_fuzzymatching

Which contains several User Defined Functions (UDFs) will get you a near optimal solution (you will still have to review matches), but you can easily fuzzy match, then calculate the similarity between strings, then a simple "=IF" function can rank them. Using this VBA module I recommended, I got the following results:

enter image description here

I noted that "Koala Bear..." in Column A matched to "S & P..." in Column B. I expected the value in Column B with "...Koala" to match. I checked the script and the Levenshtein Edit distance was actually equal for both. This scarce occurrence will require you to review your matches, but you can do this quickly by ranking your results based on string similarity. Here is a formula view of what I did:

enter image description here

To import the VBA module linked in the beginning of this answer - here is a guide: https://www.excelcampus.com/vba/copy-import-vba-code/

Note: after importing this module, you will need to enable the "Microsoft Scripting Runtime" library in the Visual Basic Editor Window it to run. Steps to do this (takes less than a minute):

From Excel Workbook:

  • Select Developer tab on ribbon
  • Select Visual Basic
  • Select Tools on the Toolbar
  • Select References
  • Scroll down until you see Microsoft Scripting Runtime, then check the box
  • Press OK

Then your all set! You can use the UDFs (just like in my second image - above) just as you would use normal excel functions! Hope this helps!

Kyle Deer
  • 91
  • 8