2

Is there a standard or best algorithm to make a given set of strings prefix-free? That is, given a set of strings, throw out all strings that have a (shorter) prefix also in that set.

In case it matters, I'm ultimately gonna implement this in Python 2.7.

Quinn Culver
  • 177
  • 7

2 Answers2

5
strings = ['a', 'apple', 'b', 'beta', 'c', 'd']

def prefices_only(strlist):
    ordered = sorted(strlist)
    last = ordered[0]
    results = [last]

    for c in ordered:
        if not c.startswith(last):
            last = c
            results.append(c)

    return results

print(prefices_only(strings))
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
aghast
  • 14,785
  • 3
  • 24
  • 56
  • 1
    @j_random_hacker: [No it doesn't.](http://ideone.com/sbGWFk) You're forgetting that `last` isn't updated when a string is discarded. The key insight is that if `ordered[i]` is a prefix of `ordered[j]`, then all elements of `ordered[i+1:j-1]` have `ordered[i]` as a prefix too. Also, the `startswith` comparisons examine each letter of the input at most twice, so calling it quadratic time is misleading. If that's O(n^2), then O(n*log(n)) time is impossible; it only seems possible because you're assigning too low a cost to the comparisons involved in a sort. – user2357112 Jan 25 '16 at 03:28
  • (Sorry - assigning too low a cost to trie insertion.) – user2357112 Jan 25 '16 at 03:42
  • @user2357112: You're right (and I'm wrong) on the two important things -- the algorithm here is correct, and `startsWith` only does a linear amount of work. In a second I'll edit and +1. OTOH I don't claim O(n log n) time for my algorithm, which is obviously impossible -- I claim time proportional to the total size of the input (which, perhaps confusingly, I didn't name with any "letter" variable), unless O(n log n) is already worse than that. (n is just the number of strings in my answer.) – j_random_hacker Jan 25 '16 at 03:49
  • Isn't `last` always the same as `results[-1]`, so that you don't need it? If you initialize with `results = ordered[:1]`, then you also don't crash for empty input like you do now. – Stefan Pochmann Jan 25 '16 at 04:35
3

[EDIT: Discard strings that have (not that are) prefixes]

  1. Sort the strings in increasing length order.
  2. Insert each string into a trie. If the insertion of a character would create a new child node for a currently childless (i.e., leaf) node, discard the current string -- it has a prefix.

[EDIT: Fixed time complexity]

The first step takes O(n log n) time to sort the n strings. If the average string length exceeds log(n), then this time complexity is dominated by the second step, which takes time (and space) linear in the total size of all the input strings. It's pretty easy to implement too.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Never heard of a 'trie' before, gonna have to read about it. And it looks like your algorithm is discarding the prefix instead of the extension; I want to keep the prefix (the shorter one). – Quinn Culver Jan 25 '16 at 03:02
  • @QuinnCulver: Whoops, I've fixed the algorithm now. Also added a link for "trie". – j_random_hacker Jan 25 '16 at 03:14
  • 2
    It seems like the sort step isn't really necessary. If a trie insertion ends on an internal node, we can just discard the node's descendants and make it a leaf. – user2357112 Jan 25 '16 at 03:16
  • @StefanPochmann: The number of strings. ("to sort the n strings") – j_random_hacker Jan 25 '16 at 03:21
  • @user2357112: Good idea! That would make for a simple and asymptotically faster (O(total size)) two-pass algorithm. I won't update my answer further as it's already mutated quite a bit, but would happily upvote your suggestion if you make it an answer. – j_random_hacker Jan 25 '16 at 03:24
  • @StefanPochmann: I don't actually claim that it's O(n log n) overall -- it's actually O(max(n log n, total_input_size)). (Maybe it's confusing because I didn't give the total input size a short variable name like 'm'...) Does that help? – j_random_hacker Jan 25 '16 at 03:27
  • @StefanPochmann: That would depend on how the strings are represented internally; I'd assumed that Python, as a GC-based language, would record string lengths explicitly, allowing it to be determined in O(1). If they're actually C-style strings, you're quite right -- you would then need to read each character just to determine the length of each string. (In this case you could also make the strings quadratically, ..., exponentially, factorially etc. long in terms of n.) – j_random_hacker Jan 25 '16 at 03:36
  • Sorry, I'm an idiot, I didn't see you're just sorting by length. I'll delete my comments and go to bed... (is 5am an excuse?) – Stefan Pochmann Jan 25 '16 at 04:13
  • I would argue that a trie is a bad choice for this, simply because the target behavior is exactly pessimal for tries - no common prefixes. This means that all the overhead of maintaining and linking separate nodes is being expended on exactly one follower - no amortization available. – aghast Jan 25 '16 at 04:19
  • @AustinHastings: Not sure what you mean... Strings can share common prefixes -- e.g. adding `abce` after `abcd` will only create 1 new node because `abc` is already in the trie. You probably need either hash tables or fixed-length arrays of alphabetSize (256?) integers for maintaining children; the latter will be very fast, since a block of nTotalChars of these arrays could be preallocated, and adding a node is then just an integer write plus an integer increment of nextFreeNodePtr, but admittedly this requires a lot of space. – j_random_hacker Jan 25 '16 at 04:37