0

I have one table with a static list of "template" strings, and another sheet that updates daily with new "populated" strings. I want to be able to identify which "template" each string is based on by finding the string with the lowest distance, and then displaying the distance between the template and the new string as a percentile.

Is it possible to do a Levenshtein distance in Excel without having to resort to Macros?

I found this function for calculating the distance, but I'm a novice with this kind of code, and I'm not sure how to include the step for locating the closest match before displaying the distance. It also seems like this might not work for long strings (200+ characters) like I'm dealing with.

I'm a sheets guy, not normally a code guy, and I'm up against a brick wall here in terms of my understanding of what to do next. Thank you for your help!

  • Welcome to [Stack Overflow](https://stackoverflow.com/tour). Your post does not ask a question. See [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) – doubleunary Nov 14 '22 at 15:22

1 Answers1

0

It is possible to calculate the Levenshtein distance between two text strings using a named function such as ztiaa's LEVDIST.

But it may be more practical to use a custom function like the one you link to, or the following implementation that fuzzy matches values to a corpus using fuzzyset.js which is a JavaScript port of the Python fuzzyset.

/**
* Finds the closest matches to words using the FuzzySet library.
* 
* @param {A2:A} words_to_find The text strings to find matches for.
* @param {Dictionary!W2:W} dictionary A list of words to match words_to_find against.
* @param {0.5} match_score_min Optional. A threshold value for how good the match must be. 1 means an exact match. The default is 0.33.
* @return {Array} The closest match to words_to_find found in the dictionary, or a blank value if there is no match.
* @customfunction
*/
function FindFuzzyMatch(words_to_find, dictionary, match_score_min) {
  'use strict';
  // version 1.2, written by --Hyde, 14 September 2022
  // - hat tip to Andreas Muller
  // - uses fuzzyset.js by Glen Chiacchieri, a port of fuzzyset by Mike Axiak
  if (arguments.length < 2 || arguments.length > 3) {
    throw `Wrong number of arguments to FindFuzzyMatch. Expected 2 or 3 arguments, but got ${arguments.length} arguments.`;
  }
  const words = Array.isArray(words_to_find) ? words_to_find : [[words_to_find]];
  const dict = Array.isArray(dictionary) ? dictionary.flat() : [dictionary];
  const fuzzyDictionary = FuzzySet(dict);
  return words.map(row => row.map(word => {
    if (!word) {
      return null;
    }
    const fuzzyMatches = fuzzyDictionary.get(word, null, match_score_min);
    return fuzzyMatches ? fuzzyMatches[0][1] : null; // get just the first result, ignoring its score
  }));
}

// include https://github.com/Glench/fuzzyset.js/blob/master/lib/fuzzyset.js below,
// and remove the final export {} block
doubleunary
  • 13,842
  • 3
  • 18
  • 51
  • Thanks for that direction! This is almost what I need, but now I'm wondering if I was starting down the wrong tree altogether with Levenshtein - Will this work as-is for comparing paragraph-length strings, or how do I set up the dictionary? (Sorry if this is a dummy-question) – Max Singer Nov 14 '22 at 17:38
  • Your "template" strings are the dictionary, and the "populated" strings are the words that you are fuzzy matching against that dictionary. The solution I gave you will work with paragraph-length strings. It uses Levenshtein distance to order the top 50 matches. Test it carefully to ensure its performance is good enough for you. – doubleunary Nov 14 '22 at 17:46