0

I've been using this VBA solution by smirkingman from another similar question for calculating Levenshtein distance between strings. I have a need to translate this to an M code function in Excel Power Query, but don't have the know-how to do so.

Hoping someone can help me out. The 3 basic transformations between strings used in Levenshtein distance are below. Each counts as 1 step. More steps = greater distance between strings.

  • Insertion
  • Deletion
  • Substitution

I thought I could "cheat" and not use a For loop-type structure as shown in the VBA example, but the test results below show that I need a more robust solution.

let
    result = (s1 as text, s2 as text) as number => 
        List.Max({Text.Length(s1),Text.Length(s2)}) - List.Count(List.Intersect({Text.ToList(s1), Text.ToList(s2)}))
in
    result

Test Results

s1 s2 result explanation
pale pole 1 substitution
dole sale 2 substitution (x2)
pool spool 1 insertion
two one 2 (incorrect) substitution and/or insert/delete (3 steps min) EXPECTED: 3
dholt
  • 33
  • 7
  • If Python is an option, go with this. https://stackoverflow.com/questions/66266748/how-can-we-add-a-field-in-a-data-frame-to-find-the-distance-between-latitude-l I've used it in the past. It works fantastic!!! – ASH Dec 10 '21 at 04:44
  • If your goal is just match strings that are similar, I've found the various "fuzzy" routines of power query to be effective. – Ron Rosenfeld Dec 10 '21 at 12:02
  • @RonRosenfeld I have only seen fuzzy matching give strings a "Similarity" score, which I'm not sure uses Levenshtein distance. I not only need to determine distance between strings, I need to count the number of strings with certain distances (e.g. 1, 2, 3 steps from s2). – dholt Dec 10 '21 at 13:35
  • It may use it as part of the fuzzy algorithm, but it won't serve your purposes, I think. You can use Python scripts in Power BI desktop /power query but I don't know about using it in Excel/power query. Or you can use `List.Generate` to loop. – Ron Rosenfeld Dec 10 '21 at 14:12
  • @RonRosenfeld can you offer a solution or some help with `List.Generate` ? I've researched and I'm having difficulty finding other examples of nested `List.Generate` functions. – dholt Jul 26 '22 at 23:15
  • @RonRosenfeld link to my PQ function since it's too long for a comment: [link](https://file.io/bLgyqOOMhdJD) – dholt Jul 26 '22 at 23:27

1 Answers1

0

If you're keen to use PowerQuery to achieve this, then you should check the Fuzzy Matching functionality. See details here and here.

However, to me, this is a 'black box' algorithm, and I am not convinced on how accurate/efficient it works... Plus, Microsoft have not published the code behind this function, so the Open Source community cannot investigate it.

The PowerBI Community seem to think that it is implementing the Jaccard similarity algorithm; which is a little bit different to the Levenshtein algorithm which you are familiar with. See more info here.

If you're keen to take a deep-dive in to the internal workings of the string-distance matrix, I implemented this in R some years ago. You can read about this in my Blog and my Repo on this topic.

I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology.

chrimaho
  • 580
  • 4
  • 22