5

The story behind

I am creating a voice controlled application using x-webkit-speech which is surprisingly good (the feature, not my app), but sometimes the user (me) mumbles a bit. It would be nice to accept the command if some reasonable part of the word matches some reasonable part of some reasonable command. So I search for the holy grail called Algorithm of the Greatest Intersect of Word in Set of Words. Could some fresh bright mind drive me out of the cave of despair?

Example

"rotation" in ["notable","tattoo","onclick","statistically"]

should match tattoo because it has the longest intersect with rotation (tat_o). statistically is the second best (tati intersect), because longer part of the word needs to be ignored (but this is bonus condition, it would be acceptable without it).

Notes

  • I use Czech language where the pronunciation is very close to its written form
  • javascript is the preffered language, but any pseudocode is acceptable
  • the minimal length of the intersect should be a parameter of the algorithm

What have I tried?

Well, it is pretty embarassing....

for(var i=10; i>=4; --i) // reasonable substring
for(var word in words) // for all words in the set
for(var j=0; j<word.length-i; ++j) // search for any i substring
// aaargh... three levels of abstraction is too much for me
Jan Turoň
  • 31,451
  • 23
  • 125
  • 169
  • 1
    Is this the same as the [edit distance](http://en.wikipedia.org/wiki/Edit_distance) but with deletions weighted at 0? – Mike Samuel May 30 '13 at 18:09
  • If by distance you mean the number of different letters in chunk, then yes. – Jan Turoň May 30 '13 at 18:11
  • @JanTuron, by "edit distance", I mean the widely-used algorithm with that name. – Mike Samuel May 30 '13 at 18:12
  • Just wikipedied it... I guess the edit distance could be helpful, but I still have a headache. – Jan Turoň May 30 '13 at 18:17
  • 2
    The Levenshtein distance algorithm is what you're looking for. [Here](http://en.wikipedia.org/wiki/Levenshtein_distance) it is in Wikipedia with a couple of sample versions in C. – Jason Nichols May 30 '13 at 18:18
  • 1
    also see: http://en.wikipedia.org/wiki/Fuzzy_string_searching To be clear, are you concerned with the speed of your algorithm or the accuracy? – Andrew W May 30 '13 at 18:20
  • @AndrewW the brute force (O(n^3)) would be enough, I have less than 100 commands. However, I still have troubles putting it together. Maybe I'm just tired right now, I guess I can use Jason's link later. – Jan Turoň May 30 '13 at 18:24
  • @JanTuroň, I just want you to know, you've obliterated productivity for me for like the next week. I just did a tuple match scoring algorithm, combined it with Levenshtein distance, and am now modifying to account for phonetic similarities. I have absolutely no use for this code, but I'm having way too much fun with it. :) – Jason Nichols May 31 '13 at 15:26
  • @JasonNichols I've just made voice controlled program menu: in lucky Czech there's no difference between phonetic and literal similarities, so it would be handy to have english version of the algorithm. So thanks for your passionate effort, I hope you'll eventually use it, too. – Jan Turoň May 31 '13 at 16:00
  • Google cloud api provides alternative matches with probabilities on streaming audio, see here https://cloud.google.com/speech-to-text/docs/word-confidence – mosh Aug 29 '18 at 22:04

3 Answers3

2

This is an algorithm that seems to work. I have no idea how good it performs compared to other already established algorithms (I suspect it perform worse) but maybe it gives you an idea how you could do it:

FIDDLE

var minInt = 3;
var arr = ["notable","tattoo","onclick","statistically"];
var word = "rotation";

var res = [];
if (word.length >= minInt) {
    for (var i = 0; i < arr.length; i++) {
        var comp = arr[i];
        var m = 0;
        if (comp.length >= minInt) {
            for (var l = 0; l < comp.length - minInt + word.length - minInt + 1; l++) {
                var subcomp = l > word.length - minInt ? comp.substring(l - word.length + minInt) : comp;
                var subword = l < word.length - minInt ? word.substring(word.length - minInt - l) : word;
                var minL = Math.min(subcomp.length, subword.length);
                var matches = 0;
                for (var k = 0; k < minL; k++) {
                    if (subcomp[k] === subword[k]) {
                        matches++;
                    }
                }
                if (matches > m) {
                    m = matches;
                }
            }
        }
        res[i] = m >= minInt ? m : null;
    }
}

console.log(res);

What happens is, that it compares the two strings by "moving" on against the other and calculates the matching letters in each position. Here you see the compared "sub"words for rotation vs. notable:

ion / notable --> one match on index 1
tion / notable --> no match
ation / notable --> no match
tation / notable --> one match on index 2
otation / notable --> no match
rotation / notable --> three matches on index 1,2,3
rotation / otable --> no match
rotation / table --> no match
rotation / able --> no match
rotation / ble  --> no match

As you see, the maximum number of matches is 3 and that is what it would return.

basilikum
  • 10,378
  • 5
  • 45
  • 58
1

Here's an implementation of a Levenshtein Distance Calculator in Javascript.

It returns an object containing the matching command and distance.

var commandArr = ["cat", "dog", "fish", "copy", "delete"]
var testCommand = "bopy";

function closestMatch(str, arr)
{
    //console.log("match called");
    var matchDist = [];
    var min, pos;
    for(var i=0; i<arr.length; i++)
    {
        matchDist[i]=calcLevDist(str, arr[i]);
        console.log("Testing "+ str + " against " + arr[i]);
    }
    //http://stackoverflow.com/questions/5442109/how-to-get-the-min-elements-inside-an-array-in-javascript
    min = Math.min.apply(null,matchDist);
    pos = matchDist.indexOf(min);

    var output = { match : arr[pos],
                  distance : matchDist[pos]
                 };

    return output;
}

function calcLevDist (str1, str2)
{
    //console.log("calc running");

    var cost = 0 , len1, len2;
    var x = 1;
    while(x > 0)
    {
        len1 = str1.length;
        console.log("Length of String 1 = " + len1);
        len2 = str2.length;
        console.log("Length of String 2 = " + len2);

        if(len1 == 0)
        {
            cost+= len2;
            return cost;
        }
        if(len2 == 0)
        {    
            cost+= len1;
            return cost;
        }
        x = Math.min(len1,len2);

        if(str1.charAt(len1 -1) != str2.charAt(len2 -1))
        {
            cost++;
        }
        else
            console.log(str1.charAt(len1-1) + " matches " + str2.charAt(len2-1));

        str1 = str1.substring(0, len1 -1 );
        str2 = str2.substring(0, len2 -1 );


        console.log("Current Cost = " + cost);
   }


}

var matchObj = closestMatch(testCommand, commandArr);
var match = matchObj["match"];
var dist = matchObj["distance"];

$("#result").html("Closest match to " + testCommand + " = " + match + " with a Lev Distance of " + dist + "." )

You can mess around with the fiddle here.

Jason Nichols
  • 3,739
  • 3
  • 29
  • 49
0

Thank you basilikum and JasonNichols and also Mike and Andrew for the comments, it really helped me to finish the algorithm. I come up with my own brute force O(n^3) solution in case someone runs into this question with the same problem.

Anyone is invited to play with the fiddle to improve it.

The algorithm

/**
 * Fuzzy match for word in array of strings with given accurancy
 * @param string needle word to search
 * @param int accurancy minimum matching characters
 * @param array haystack array of strings to examine
 * @return string matching word or undefined if none is found
 */
function fuzzyMatch(needle,accurancy,haystack) {
  function strcmpshift(a,b,shift) {
    var match=0, len=Math.min(a.length,b.length);
    for(var i in a) if(a[i]==b[+i+shift]) ++match;
    return match;
  }
  function strcmp(a,b) {
    for(var i=0,max=0,now; i<b.length; ++i) {
      now = strcmpshift(a,b,i);
      if(now>max) max = now;
    }
    return max;
  }

  var word,best=accurancy-1,step,item;
  for(var i in haystack) {
    item = haystack[i];
    step = Math.max(strcmp(item,needle),strcmp(needle,item));
    if(step<=best) continue;
    best=step, word=item;
  };
  return word;
}

Example

var word = "rotation";
var commands = ["notable","tattoo","onclick","statistically"];
// find the closest command with at least 3 matching characters
var command = fuzzyMatch(word,3,commands);
alert(command); // tattoo
Jan Turoň
  • 31,451
  • 23
  • 125
  • 169