This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2)
that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2)
. Implement concat_all(strs)
that concatenates a list of strings using only the concat(str1, str2)
function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
- Naive solution. Loop through the list and concatenate
def concat_all(strs): result = '' for str in strs: result = concat(result, str) return result
- Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs): heap = MinHeap(strs, key_func=len) while len(heap) > 1: str1 = heap.pop() str2 = heap.pop() heap.push(concat(str1, str2)) return heap.pop()
- Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs): if len(strs) == 1: return strs[0] if len(strs) == 2: return concat(strs[0], strs[1]) mid = len(strs) // 2 str1 = concat_all(strs[:mid]) str2 = concat_all(strs[mid:]) return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n
and the total number of characters in all the strings is m
. The upper bound for the naive solution is O(mn)
. The binary-concat has an exact bound of theta(mlog(n))
. It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n))
. The second term, nlog(n)
is associated with maintaining the heap; there are n
concats and each concat updates the heap in log(n)
. If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n))
. Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n)
is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))
?