Find the most frequent prefix of length n in Trie

Question

I am working with Trie data structure and am trying to find:

the most frequent prefix of length n
the most frequent prefix of length n or more

There is a similar post but no code had been provided. Possible solution which was suggested in the mentioned post, that I am unable to implement correctly:

Adding a counter for each trie node and incrementing the value when the node is used, then scanning the nodes to examine where the most frequent prefix is.

I am able to count the usage of each node, but I don't know how to examine which prefix is the most frequent one and then how to build this prefix out of the nodes. Question: How do I find the most frequent prefix of length n (or >= n) based on the counter and how to print such a prefix?

My trie has thousands of words but for demo purpose we can stick to a set of words:

Trie t = new Trie();

t.insert("test");
t.insert("testify");
t.insert("television");
t.insert("car");
t.insert("cat");
t.insert("cabin");
t.insert("cab");

findMostFrequentPrefix(t.getRoot(), 2);

// expected output for provided data set: 
// 1) returns "ca" for exact length of 2 // "ca" appears 4 times and "te" appears 3 times
// 2) (not exactly sure about this) returns "ca" (appears 4 times, the second most frequent "cab" appears only 2 times) for length 2 or more, however if there was an additional most frequent prefix e.g. "cad" (5 times appearance), the function should return this one, instead of "ca"

Code for Trie, Trie node:

public class TrieNode {
    TrieNode[] children;
    boolean isEnd;
    char value;
    int counter;

    public TrieNode() {
        this.children = new TrieNode[26];
    }

    public TrieNode getNode(char ch) {
        if (this.children.length == 0) {
            return null;
        }
        for (TrieNode child : this.children) {
            if (child != null)
                if (child.value == ch)
                    return child;
        }
        return null;
    }

    public TrieNode[] getChildren() {
        return children;
    }

    public char getValue() {
        return value;
    }
}

public class Trie {
    private TrieNode root;

    public Trie() {
        root = new TrieNode();
    }

    public TrieNode getRoot() {
        return this.root;
    }

// Inserts a word into the trie.
    public void insert(String word) {
    TrieNode p = root;
    for (int i = 0; i < word.length(); i++) {
        char c = word.charAt(i);        
        int index = c - 'a';
        // p.counter++ incrementing the value which later will be used to examine the frequency
        if (p.children[index] == null) {
            TrieNode temp = new TrieNode();
            p.children[index] = temp;
            p.children[index].value = c;
            p.counter++;
            System.out.println("Counter: " + p.counter + " - for " + p.value);
            p = temp;
        } else {
            p.counter++;
            System.out.println("Counter: " + p.counter + " - for " + p.value);
            p = p.children[index];
        }
    }
    p.isEnd = true;
}

// Returns if the word is in the trie.
    public boolean search(String word) {
        TrieNode p = searchNode(word);
        if (p == null) {
            return false;
        } else {
            if (p.isEnd)
                return true;
        }
        return false;
    }

// Returns if there is any word in the trie
// that starts with the given prefix.
    public boolean startsWith(String prefix) {
        TrieNode p = searchNode(prefix);
        if (p == null) {
            return false;
        } else {
            return true;
        }
    }

    public TrieNode searchNode(String s) {
        TrieNode p = root;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            int index = c - 'a';
            if (p.children[index] != null) {
                p = p.children[index];
            } else {
                return null;
            }
        }
        if (p == root)
            return null;
        return p;
    }
}

Method to traverse the trie and find the prefix of length of n (There should be an additional function to find the prefix of length >= n):

public static void findMostFrequentPrefix(TrieNode current, int n) {
        for (TrieNode temp : current.children) {
            if(temp != null) {
                // logic to examine which counter value is the greatest  
                // and then (maybe?) follow the idea until the length of n of prefix is reached             
                System.out.println(temp.value);
                findMostFrequentPrefix(temp, n);
            }
        }
    }

Update: Clearer explanation of the second case (>= n):

Let's say we had a set of data:

test
testi
testif
testify

If I wanted to find the most common prefix of length n = 2, I could call a function provided in the first answer: findMostFrequentPrefix(t.getRoot(), 2); - it would return te.

However in the second case I want to find the most frequent prefix whose length can be greater than or equal to n (>= n). Using the data provided above, if I called a new function findMostFrequentPrefixGreaterEqual(t.getRoot(), 1); then the most frequent prefix can be:

 1. n = 1 - "t" - max frequency = 4
 2. n = 2 - "te" - max frequency = 4
 3. n = 3 - "tes" - max frequency = 4
 4. n = 4 - "test - max frequency = 4

This means that the frequency is the same. The function could indicate that there are the same frequencies but return the prefix whose length is the greatest, in this case it would be: test.

Expected output 2 is not clear. 1. How do you pass "2 or more" to the function. 2. Why does it output "cab" and "tes" when "ca" is occurring more times than those? — Emil M George, Apr 26 '20 at 16:04
Yeah, you're right , it should return "ca". However in that case I would like to find a prefix whose length is greater than or equal to `n`, so if there was a longer prefix than `n`, but it would be the most frequent one, then I would like to return exactly that prefix. I updated it. — kabugh, Apr 26 '20 at 16:20
You mean: if the words are `cat`, `cata` `cate` and you call `findMostFrequentPrefix(t.getRoot(), 2);` you should get `cat` instead of `ca`? But you should get `ca` if there was one more word `cax`? — Emil M George, Apr 26 '20 at 16:33
There should be 2 versions of a function as I tried to explain in the beginning: 1) if I call `findMostFrequentPrefix(t.getRoot(), 2);` - the returned value should have the length of passed param (in this case `n = 2`) - `ca`. 2) If I call `findMostFrequentPrefixGreaterOrEqual(t.getRoot(), 2);` - the function should return a prefix whose length can be `n >= 2`. So it should return `cat` if the possible prefix is greater than 2, if not then it should return `ca`. — kabugh, Apr 26 '20 at 16:40

RaffleBuffle · Accepted Answer · 2020-04-26T20:37:17.573

Firstly, I think you have a small but significant bug in your code tracking the usage of each TrieNode. If you examine the value of counter for the TrieNode representing the b in cab you'll see that it's 1, when it should be 2, for:

cab
cabin

You need to record the usage of nodes representing the last character in a word by updating the code at the end of your insert method to be:

p.counter++;
p.isEnd = true;

With that update, you can get the most frequent prefixes (there may be more than one) with something like this:

public static int findMostFrequentPrefix(TrieNode current, int n, int max, char[] curr, int depth,
        List<Prefix> result)
{
    if (n == 0)
    {
        if (current.counter >= max)
        {
            if (current.counter > max)
            {
                result.clear();
                max = current.counter;
            }
            result.add(new Prefix(max, String.valueOf(curr)));
        }
        return max;
    }

    for (TrieNode child : current.children)
    {
        if (child != null)
        {
            curr[depth] = child.value;
            max = Math.max(max, findMostFrequentPrefix(child, n - 1, max, curr, depth + 1, result));
        }
    }
    return max;
}

Using a small PrefIx helper class:

static class Prefix
{
    int freq;
    String value;
    public Prefix(int freq, String value)
    {
        this.freq = freq;
        this.value = value;
    }
}

Test:

int len = 3;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefix(t.getRoot(), len, 0, new char[len], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
    System.out.println(p.value);

Output:

max frequency: 2
cab
tes

Here's an attempt at implementing your 2nd method, findMostFrequentPrefixGreaterEqual. You'll have to test and confirm that it returns expected results in all cases:

public static int findMostFrequentPrefixGreaterEqual(TrieNode current, int n, int max, char[] curr, int depth,
        List<Prefix> result)
{
    if (n <= 0)
    {
        if (current.counter >= max)
        {
            result.clear();
            max = current.counter;
            result.add(new Prefix(max, String.valueOf(curr, 0, depth)));
        }
        else
        {
            return max;
        }
    }

    for (TrieNode child : current.children)
    {
        if (child != null)
        {
            curr[depth] = child.value;
            max = Math.max(max, findMostFrequentPrefixGreaterEqual(child, n - 1, max, curr, depth + 1, result));
        }
    }
    return max;
}

Test:

t.insert("test");
t.insert("testi");
t.insert("testif");
t.insert("testify");

int len = 2;
int maxLen = 7;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefixGreaterEqual(t.getRoot(), len, 0, new char[maxLen], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
    System.out.println(p.value);

Note that we now have to know the max length of any words in the Trie in order to allocate enough space in the char array.

Output:

max frequency: 4
test

Thank you very much, that's exactly what I needed and it's clearly explained. Now I will try to implement the 2nd case. — kabugh, Apr 26 '20 at 17:22
I've tried to implement the second case but unfortunately it doesn't work as expected. If you had time, could you possibly have a look at it? I described it the best I could at the end of the question section. — kabugh, Apr 26 '20 at 18:13
Most common prefix with length `n` is also the most common prefix with length `n or more`. — asds_asds, Apr 26 '20 at 18:13

Find the most frequent prefix of length n in Trie

1 Answers1