algorithm to partition strings by prefix

Question

Given a list of strings L (sorted), and a positive integer N (N <= len(L)), how to efficiently partition L into no more than N groups by common prefix of length N?

Example: define data structure and function as below:

type PrefixGroup struct {
    Prefix string 
    Count  int
}
func partition(L []string, N int, prefix string) []PrefixGroup

The list L may contains thousands of strings, when called with

partition(L, 8, "")

the output may be:

[
    {"Prefix":"13", "Count":1000},
    {"Prefix":"180": "Count": 10},
    {"Prefix":"X": "Count": 2},
    ... ...
]

which means in L, there are 1000 strings start with "13", 10 start with "180" and 2 start with "X". Note that the length of prefix is not fixed. The key requirement of this algorithm is to partition strings with common prefix so that the number of groups is as close as but not exceeding N.

With the result above, I can then call partition(L, 8, "13") to further drill down subset of L that start with "13":

[
    {"Prefix":"131", "Count": 50},
    {"Prefix":"135": "Count": 100},
    {"Prefix":"136": "Count": 500},
    ... ...
]

This is not a homework question. I need to write such an algorithm for a project at hand. I can write it "brute-force"-ly, just wonder if there are any classic/well-known data structure and/or algorithm to achieve proven time/space efficiency.

I have considered trie, but wonder if it may consume too much memory...

It would help if you can specify your space/time requirement and data scale. At a scale of len(L)~1e3, there is no need to write any extra code. — leaf bebop, Jan 10 '18 at 05:11
do you mean "common prefix of length at most N" in the first sentence? — Petar Petrovic, Jan 10 '18 at 07:04
@Petar Petrovic, the "at most N" means at most N **groups**, not the prefix length. — xrfang, Jan 10 '18 at 07:27
@leaf bebop I will try brute force method first, and do some research on the answers provided here. — xrfang, Jan 10 '18 at 07:28
@xrfang so "groups by common prefix of length N" means each prefix must have length N? Then why do you have this requirement "the number of groups is as close as but not exceeding N" ? if the length of prefix is fixed, shouldn't number of groups is also fixed? — Petar Petrovic, Jan 10 '18 at 07:45
@Petar Petrovic **no**. N has nothing to do with prefix length, it is the maximum group to partition the list of strings. Also note that the prefix is a fixed thing, initially "" (empty string). It is used to "drill down" the list, when it is set, all strings not prefixed by it is ignored. Let's forget about the prefix parameter, just say it is the empty string. — xrfang, Jan 10 '18 at 08:53
@xrfang thanks but I am not sure I get it. So the goal is dividing the set of strings into at most N groups, and every string in group g must have prefix g_p. And more groups is better. is that correct? — Petar Petrovic, Jan 10 '18 at 09:43
@Petar Petrovic I am sorry that the "prefix" really confused you. Please forget about it. There is NO external given prefix to check, just say, you want to browse a dictionary, in lexical order. you don't want to list all words on a page, because that will be too much. You list the letter A-Z, i.e. 26 groups, which the number of words in each group: A(123), B(456)... Z(98). Then if you click on A, you get a list of words starting with A, AA(12), AB(34), ..... that's it. — xrfang, Jan 11 '18 at 02:08

score 1 · Answer 1 · answered Jan 10 '18 at 02:06

Well there are a few algorithms, but prefix tree is to way to go.

A prefix tree, or trie (often pronounced "try"), is a tree whose nodes don't hold keys, but rather, hold partial keys. For example, if you have a prefix tree that stores strings, then each node would be a character of a string. If you have a prefix tree that stores arrays, each node would be an element of that array. The elements are ordered from the root. So if you had a prefix tree with the word "hello" in it, then the root node would have a child "h," and the "h" node would have a child, "e," and the "e" node would have a child node "l," etc. The deepest node of a key would have some sort of boolean flag on it indicating that it is the terminal node of some key. (This is important because the last node of a key isn't always a leaf node... consider a prefix tree with "dog" and "doggy" in it). Prefix trees are good for looking up keys with a particular prefix.

score 1 · Accepted Answer · answered Jan 10 '18 at 05:11

1

You need to use a Radix trie. You can read about the difference between a trie and a Radix trie.

answered Jan 10 '18 at 05:11

Pavel Nikolov

9,401
5
43
55

algorithm to partition strings by prefix

2 Answers2