Complexity analysis of a solution to minimizing concat cost

Question

This is about analyzing the complexity of a solution to a popular interview problem.

Problem

There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.

Warnings

Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.

You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.

Solutions

Naive solution. Loop through the list and concatenate def concat_all(strs): result = '' for str in strs: result = concat(result, str) return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap. def concat_all(strs): heap = MinHeap(strs, key_func=len) while len(heap) > 1: str1 = heap.pop() str2 = heap.pop() heap.push(concat(str1, str2)) return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate. def concat_all(strs): if len(strs) == 1: return strs[0] if len(strs) == 2: return concat(strs[0], strs[1]) mid = len(strs) // 2 str1 = concat_all(strs[:mid]) str2 = concat_all(strs[mid:]) return concat(str1, str2)

Complexity

What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.

Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.

I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.

But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?

The minheap solution may reorder the input strings, which is (in the typical version of the problem) not allowed. None of the solutions here minimize the cost of concatenations (if reordering isn't allowed). — Paul Hankin, Apr 19 '18 at 18:37
The binary-concat does not re-order the strings and has a tight bound of `mlog(n)`. — neurite, Apr 19 '18 at 20:56
Yes, but the question asks to minimize total cost, not asymptotic complexity. For example, the optimal solution for strings of length 1, 2, 3, 4, 5 is 35 (`((s[0]+(s[1]+s[2]))+(s[3]+s[4]))`) but `concat_binary` generates a cost of 39 (`((s[0]+s[1])+(s[2]+(s[3]+s[4])))`). — Paul Hankin, Apr 20 '18 at 08:20
I did a search, and I saw similar answers using minheap and binary concat in various forums, but it looks like most of these solutions are copy-pasted from someone else's flawed solution. I mean -- the minheap solution is clearly wrong because it reorders the strings and so is functionally incorrect, and it seems unlikely to me multiple people independently make the same mistake. — Paul Hankin, Apr 20 '18 at 08:26
I made a mistake: optimal for strings of length 1, 2, 3, 4, 5 is cost 33 (`(((s[0]+s[1])+s[2])+(s[3]+s[4]))`) — Paul Hankin, Apr 20 '18 at 08:32
Here's code that finds the optimal solution (and compares it to concat_binary). The code generates parse trees so that one can see the solution, which makes things a little more complex than simply finding the optimal cost, but I hope it makes sense. https://gist.github.com/paulhankin/c822847d54d73384b7dab4b26713cddb#file-concat-py — Paul Hankin, Apr 20 '18 at 08:43

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Let us call the length of strings 1 to n and m be the sum of all these values.

For the naive solution, clearly the worst appears if m1 is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).

To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:

at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)

Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 19 '18 at 18:29

Damien Prot

573
3
7

I have thought about the exact, same worst-case for the min-heap solution too. Why is that the worst-case for min-heap? I am wondering how to prove or persuade another person that it is indeed the worst-case for the min-heap solution. – neurite Apr 19 '18 at 18:34
If we consider two instances of the problem with the same value of m and n, one instance I having different values for m(1) to m(n) and the other instance I' having all values equal, we can prove that each concat cannot be more expensive in I than in I', hence leading to I' being the worst-case – Damien Prot Apr 19 '18 at 19:34
I understand we need prove that. How to (rigorously) prove it is what I am looking for:) – neurite Apr 19 '18 at 21:02
@neurite I edited my answer, I hope it is enough rigorous :) – Damien Prot Apr 20 '18 at 06:28
Bravo! Accepted. For those of you who don't immediately connect the harmonic series to log of n, like me, here is [a good explanation](https://stackoverflow.com/questions/25905118/finding-big-o-of-the-harmonic-series). – neurite Apr 20 '18 at 07:26

Complexity analysis of a solution to minimizing concat cost

1 Answers1