How can I read this Radix Tree structure to determine next-string probability?

Question

In JavaScript, I am attempting to take a given user input and guess the 3 most likely words that might complete the user's currently (incomplete) typed word. The guess is based on the user's past inputs. I'm working on this here, in this JSFiddle.

The structure I build to record the user's past inputs is a modified Radix Tree (AKA Patricia Trie):

Input: "hey"

{
    "h": {
        "value": "h",
        "count": 1,
        "followables": {
            "e": {
                "value": "e",
                "count": 1,
                "followables": {
                    "y": {
                        "value": "y",
                        "count": 1,
                        "followables": {}
                    }
                }
            }
        }
    }
}

This data structure is built perfectly, and I think it's the best structure to achieve the goal described. My problem is the function for reading the Radix Tree data to define the 3 most likely words for a given input. In the above data for example, if the user inputs "h", the guessing function should return an object like this:

guess : {
   1 : "hey",
   2 : "",
   3 : ""
}

So here's my code/progress:

Learn - take completed input string and organize the combination into the Radix Tree (brain):

function learn(message, brain) {
    if (message.length == 0) return {}; // or do something else
    var ch = message[0]; // get the first character
    if (!brain[ch]) { // create new node when not exists
        brain[ch] = {
            value: ch,
            count: 1,
            followables: {}
        };
    } else { // increment count when exist
        brain[ch].count += 1;
    }
    var substr = message.substring(1); // remove first character
    if (substr) { // do it for the remaining substring
        brain[ch].followables = learn(substr, brain[ch].followables);
    } else {
        renderData();
    }
    return brain;
}

That's all done right. Unfortunately, the next code, meant to read the data and guess the word that the user is typing, is not good. I'm having trouble with what's to me a very complex function. I've divided it into small functions, as I've learned is the best practice, but I'm afraid I've made a mess that could probably be much simpler:

Guess - take the "learned" string data and make 3 guesses at which word the user might be typing:

function guess(progress, brain) {
    console.log("Guessing based on: " + progress);
    var guesses = {
        0: "",
        1: "",
        2: ""
    }
    var firstChar = progress[0]; 
    if (brain[firstChar]) {
        var step = brain[firstChar];
        for (var i = 0; i < progress.length; i++) {
            var char = progress[i];
            if (step.followables[char]) {
                step = step.followables[char];
                if (i == progress.length) {
                    var guesses = nextStrings(step.followables);
                    renderGuesses(guesses);
                }
            } else {
                renderGuesses(guesses);
            }
        }
    } else {
        renderGuesses(guesses);
    }
}

function renderGuesses(guesses) {
    console.log(guesses);
    $('#guess-1').text(guesses[0]);
    $('#guess-2').text(guesses[1]);
    $('#guess-3').text(guesses[2]);
}

function nextStrings(followables) {
    console.log('Searching for next string...');
    var results;
    if (followables.length > 0) {
        results = chooseRoutes(followables);
    } else {
        results = {
            0: "",
            1: "",
            2: ""
        }
    }
    console.log(result);
    return result;
}

function chooseRoutes(followables) {
    var results = {
        0: {
            value: "",
            count: 0
        },
        1: {
            value: "",
            count: 0
        },
        2: {
            value: "",
            count: 0
        }
    };
    for (var i = 0; i < followables.length; i++) {
        var count = followables[i].count;
        if (count > results[0].count) {
            results[0].value = followStr(followables[i], "");
        } else if (count > results[1].count) {
            results[1].value = followStr(followables[i], "");
        } else if (count > results[2].count) {
            results[2].value = followStr(followables[i], "");
        }
    }
    console.log(results);
    return results;
}

function followStr(followables, str) {
    var guess = {
        value: "",
        count: 0
    };
    for (var i = 0; i < followables.length; i++) {
        if (followables[i].count > guess.count) {
            guess = followables[i];
        }
    }
    followables = guess.followables;
    if (guess.value != " ") {
        str += guess;
        followStr(followables, str);
    } else {
        console.log(str);
        return str;
    }
}

Sidenote - While a fuzzy string search made on a dictionary is a more common approach to this, the learning method is a great way to tailor guesses to the user's writing/messaging style and supporting the user's non-standard vocabulary ("heyy", "sup", ":P", "lol") - the results of these guesses can be combined with (and take priority over) standard dictionary results.

As a subjective aside, I'm actually a bit frustrated at my inability to build this `guess` function. Is the function / overall goal actually advanced/difficult, or do I just need more practice? I've been studying JS for my past 2-3 years of highschool and I'm curious to know whether at this point I should be able to breeze through something like this... — , Feb 16 '15 at 04:30
This is not a particularly easy algorithm to implement, but take a look at http://stackoverflow.com/questions/2901831/algorithm-for-autocomplete You'd probably find it much easier though if you had a database of words to match though. — Qantas 94 Heavy, Feb 16 '15 at 04:36
@Qantas94Heavy sure, I would likely want to combine this with a typical fuzzy string search based on a dictionary in a production implementation. My intent here is to focus on the user's input, which is good at guessing common-but-incorrect strings like "u", ":P", "<3", "sup", "heyy", "bro", etc. Dictionary searches would be used in production to fill in where no combinations exist for a guess. — , Feb 16 '15 at 04:47
Um, your example tree is not an actual trie, it's not compressed? And what exactly does that `.count` represent? — Bergi, Feb 19 '15 at 00:21
@Bergi Perhaps not - I haven't looked much at the fundamental idea behind a trie, but it is structured like one, and follows the same idea, minus, perhaps, the compression you're talking about. And `count` represents the number of times that the combination has been recorded, thus recording probability. — , Feb 19 '15 at 05:25
@MediaWebDev: But then the count should only be stored in the leaves? The tree you've shown represents `1x h, 1x he, 1x hey` — Bergi, Feb 19 '15 at 09:47
@Bergi nope, you missed one factor in this: How does one check the probability of the different possibilities without going down the entire tree top check all word probabilities. By storing the probabilities for all combinations, when the user enters his first letter`h`, we can immediately follow the tree to know his second letter is most likely `e`, without going down the entire branch and checking all `h` words to finally find out that the string `hello` is most likely. — , Feb 19 '15 at 13:55
@Bergi see? when `h` is typed, of 100 different combinations that might come after it, we can immediately know to follow the `e` `a` `o` branches for our 3 guesses only, because those probabilities are stored at each level. — , Feb 19 '15 at 13:58
Um, so you only want to get a guess for the next letter or a guess for the whole word? — Bergi, Feb 19 '15 at 17:52
@Bergi based on the current letters typed, I want to guess the full intended "word" (again, it might not be a real word). Based on your suggestion of storing the count only at the full words, in order to check the probability of those words, we must search the entire tree up to the point of the first space character in each branch. That just doesn't make sense to me. Perhaps you could suggest your approach in an answer, I may misunderstand your intention. — , Feb 19 '15 at 19:43
@MediaWebDev: OK; so storing the maxima of the subtree on its root is fine. Still I'd expect some extra count value that is only on the (leaf) nodes, so that you can distinguish the counts of "hi", "high" and "hifi". — Bergi, Feb 19 '15 at 19:48

skobaljic · Accepted Answer · 2015-02-19T15:24:30.837

The structure you used for dictionary is not correct, it should contain array(s) of objects. For example, after you enter these words:

hi
hi
hi
hi
hi
hey
hello
hella

the structure should be:

history: [{
    letter: "h",
    count: 8,
    followables: [{
        letter: "e",
        count: 3,
        followables: [{
            letter: "y",
            count: 1,
            followables: []
        }, {
            letter: "l",
            count: 2,
            followables: [{
                letter: "l",
                count: 2,
                followables: [{
                    letter: "o",
                    count: 1,
                    followables: []
                }, {
                    letter: "a",
                    count: 1,
                    followables: []
                }]
            }]
        }]
    }, {
        letter: "i",
        count: 5,
        followables: []
    }]
}]

The way you create and store the history (I would use localStorage) was not interesting to me. Focus was on recursive functions that dig deep inside the tree to get suggestions. This one gets final followables for given word:

findChildren: function (node, depth) {
    /* Traverse history for current word, there is only one path */
    for (k in node) {
        if (node[k].letter == app.progress[depth]) {
            if (depth + 1 == app.progress.length) {
                /* Found it, hope it has followables */
                return node[k].followables;
            } else {
                /* Must go deeper... */
                return app.findChildren(node[k].followables, depth + 1);
            };
        };
    };
    /* No results */
    return false;
}

and second one (getWord) creates suggestions:

countWordsFromNode: function (node) {
    for (i in node) {
        if (node[i].followables.length) {
            app.countWordsFromNode(node[i].followables);
        } else {
            app.totalInNode++;
        };
    };
},
getWord: function (node, limit) {
    /* First sort by count */
    var sorted = node.sort(function (n1, n2) {
        return n2.count - n1.count;
    });
    for (k in sorted) {
        app.guesses[app.totalFound].word += sorted[k].letter;
        if (sorted[k].followables.length) {
            app.totalInNode = 0;
            app.countWordsFromNode(sorted[k].followables);
            for (m = 1; m < app.totalInNode; m++) {
                if ((app.totalFound + m) < limit) {
                    app.guesses[app.totalFound + m].word += sorted[k].letter;
                };
            };
            app.getWord(sorted[k].followables, limit);
        } else {
            /* End of word */
            app.totalFound++;
        };
        if (app.totalFound >= limit) {
            /* Found all suggestions */
            break;
        };
    };
}

You can see details in this Fiddle, I had to remove some code of yours. It will be easy to integrate it anywhere, for example you can set other suggestion fields, current are:

guesses: [
    {
        element: $('#guess-1'),
        word: ''
    },
    {
        element: $('#guess-2'),
        word: ''
    },
    {
        element: $('#guess-3'),
        word: ''
    }
]

Edit: Fixed bug with adding letters to the right.

Can you explain why arrays are used rather than objects? It seems to me that you must 'brute force' search the arrays now instead of taking advantage of the fact that objects are organized into hash maps. I just don't understand why, when by storing all items as object properties which are accessed instantly, the lookups would seem to be exponentially more eficient. This data structure could be potentially massive. Perhaps I'm wrong, I've only been at this for a couple of years, so feel free to correct me. — , Feb 19 '15 at 00:22
If you compare the new structure [here](http://jsfiddle.net/c38vsegs/59/) of same data you had, you will see it is better because it is not redundant and it is more intuitive: after one letter you expect an array of letters. Further more, the javascript `sort` is crucial to get most frequent letter in that array - without it you would have to create new one (till now I haven't seen sorting objects without arrays). — skobaljic, Feb 19 '15 at 00:50
[I decided to ask a question about this on Code Review.](http://codereview.stackexchange.com/questions/81883/which-of-these-two-implementations-of-building-this-modified-radix-tree-is-more) — , Feb 19 '15 at 02:05
Oh and by the way, thank you so much for your effort - your method might be perfect - it certainly *works* (+1). I'm simply interested in learning whether or not this solution is efficient. — , Feb 19 '15 at 05:32
I can find/create an error in my algorithm, will have to fix it. — skobaljic, Feb 19 '15 at 12:56
I fixed the only bug I could find. Hope someone can optimize this code (and add a storage function), it may be very useful. — skobaljic, Feb 19 '15 at 15:26
@skobaljic: I'm a little fuzzy on what you mean by "not correct." — Robert Harvey, Feb 20 '15 at 02:32
It is incorrect in sense that it does not provide efficiency of data structure. You cannot observe the structure itself only, because it is not independent and we need to create methods and get relevant data from it. I used arrays because I knew I had to sort it, but if the `history` were sorted first place, than we could use some other form of data structure. — skobaljic, Feb 20 '15 at 09:10

How can I read this Radix Tree structure to determine next-string probability?

1 Answers1