I am working with Trie data structure and am trying to find:
- the most frequent prefix of length n
- the most frequent prefix of length n or more
There is a similar post but no code had been provided. Possible solution which was suggested in the mentioned post, that I am unable to implement correctly:
Adding a counter for each trie node and incrementing the value when the node is used, then scanning the nodes to examine where the most frequent prefix is.
I am able to count the usage of each node, but I don't know how to examine which prefix is the most frequent one and then how to build this prefix out of the nodes. Question: How do I find the most frequent prefix of length n (or >= n) based on the counter and how to print such a prefix?
My trie has thousands of words but for demo purpose we can stick to a set of words:
Trie t = new Trie();
t.insert("test");
t.insert("testify");
t.insert("television");
t.insert("car");
t.insert("cat");
t.insert("cabin");
t.insert("cab");
findMostFrequentPrefix(t.getRoot(), 2);
// expected output for provided data set:
// 1) returns "ca" for exact length of 2 // "ca" appears 4 times and "te" appears 3 times
// 2) (not exactly sure about this) returns "ca" (appears 4 times, the second most frequent "cab" appears only 2 times) for length 2 or more, however if there was an additional most frequent prefix e.g. "cad" (5 times appearance), the function should return this one, instead of "ca"
Code for Trie, Trie node:
public class TrieNode {
TrieNode[] children;
boolean isEnd;
char value;
int counter;
public TrieNode() {
this.children = new TrieNode[26];
}
public TrieNode getNode(char ch) {
if (this.children.length == 0) {
return null;
}
for (TrieNode child : this.children) {
if (child != null)
if (child.value == ch)
return child;
}
return null;
}
public TrieNode[] getChildren() {
return children;
}
public char getValue() {
return value;
}
}
public class Trie {
private TrieNode root;
public Trie() {
root = new TrieNode();
}
public TrieNode getRoot() {
return this.root;
}
// Inserts a word into the trie.
public void insert(String word) {
TrieNode p = root;
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
int index = c - 'a';
// p.counter++ incrementing the value which later will be used to examine the frequency
if (p.children[index] == null) {
TrieNode temp = new TrieNode();
p.children[index] = temp;
p.children[index].value = c;
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = temp;
} else {
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = p.children[index];
}
}
p.isEnd = true;
}
// Returns if the word is in the trie.
public boolean search(String word) {
TrieNode p = searchNode(word);
if (p == null) {
return false;
} else {
if (p.isEnd)
return true;
}
return false;
}
// Returns if there is any word in the trie
// that starts with the given prefix.
public boolean startsWith(String prefix) {
TrieNode p = searchNode(prefix);
if (p == null) {
return false;
} else {
return true;
}
}
public TrieNode searchNode(String s) {
TrieNode p = root;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
int index = c - 'a';
if (p.children[index] != null) {
p = p.children[index];
} else {
return null;
}
}
if (p == root)
return null;
return p;
}
}
Method to traverse the trie and find the prefix of length of n (There should be an additional function to find the prefix of length >= n):
public static void findMostFrequentPrefix(TrieNode current, int n) {
for (TrieNode temp : current.children) {
if(temp != null) {
// logic to examine which counter value is the greatest
// and then (maybe?) follow the idea until the length of n of prefix is reached
System.out.println(temp.value);
findMostFrequentPrefix(temp, n);
}
}
}
Update: Clearer explanation of the second case (>= n):
Let's say we had a set of data:
test
testi
testif
testify
If I wanted to find the most common prefix of length n = 2
, I could call a function provided in the first answer: findMostFrequentPrefix(t.getRoot(), 2);
- it would return te
.
However in the second case I want to find the most frequent prefix whose length can be greater than or equal to n (>= n)
. Using the data provided above, if I called a new function findMostFrequentPrefixGreaterEqual(t.getRoot(), 1);
then the most frequent prefix can be:
1. n = 1 - "t" - max frequency = 4
2. n = 2 - "te" - max frequency = 4
3. n = 3 - "tes" - max frequency = 4
4. n = 4 - "test - max frequency = 4
This means that the frequency is the same. The function could indicate that there are the same frequencies but return the prefix whose length is the greatest, in this case it would be: test
.