Count substrings with a minority of tails from a string of coin tosses

Question

Given a sequence of heads and tails I want to count the number significant substrings in which the number of heads is not less than the number of tails. I want to achieve this in O(NlogN) time.

Example input: [ 'H', 'T', 'H', 'T', 'T', 'H' ]

Example output:

Explanation:

{H} {H} {H}
{H, T} {T, H} {H, T} {T, H}
{H, T, H} 
{H, T, H, T} 
{H, T, T, H} 
{H, T, H, T, T, H}

I believe my current algorithm is O(N^2). I solve the problem recursively, iterating with the list of coins sliced on either end.

Here is my current algorithm. How can I achieve O(NLogN) time?

def count_sequences( data ):
    print go(range(0,len(data)),data)
seen = set()
def go(rang,data):
    if tuple(rang) in seen: return 0
    seen.add(tuple(rang))
    h = 0
    summ = 0
    if len(rang)==0: return 0
    for i in rang:
        if data[i] == 'H': h += 1
        summ += go(rang[1:],data)
        summ += go(rang[:-1],data)
    if len(rang) == 1: 
        if h ==1: return 1
        else: return 0
    if h > (len(rang)-1)/2 : 
        return 1 + summ
    else: return summ

Questions like these, which are not solving a specific problem but are more of code quality/improvement, are best suited for codereview.stackexchange.com — Burhan Khalid, Jan 11 '15 at 07:53
"find all subsets" and "how many subsequence" are different questions. Your output shows that you are interested in subsequences (not subsets) because you consider `{H, T, H, T}` and `{H, T, T, H}` to be different. There can be exponential number of subsequences. You can't **enumerate** them in `O(n log n)` time. I don't know whether you can **count** them in `O(n log n)` time. — jfs, Jan 11 '15 at 09:22
@J.F.Sebastian: I think OP actually means substring, and not either subsequence nor subset. His example does not include the subsequence `{H, H, T}`, for example. (Also, the title includes the word "adjacent".) — rici, Jan 11 '15 at 16:01
@rici: I think you are right. The point stands: *enumerate* substrings (worse case `O(N*N)`) is a different problem than to *count* them (time complexity might be better) — jfs, Jan 11 '15 at 16:07
[Cross-posted to Code Review](http://codereview.stackexchange.com/q/77224/10916); has answers there. — Janne Karila, Jan 11 '15 at 17:25

score 2 · Accepted Answer · answered Aug 19 '15 at 01:39

Here's an O(n) solution.

Imagine instead of H and T, you had 1 and -1 in your array. This reduces the problem to computing the number of non-negative sum subarrays. This is a known problem that can be solved computing the accumulated array and finding the number of inversions.

This can be naively computed in O(n^2) searching for pairs i < j where A[i]>A[j]. It can be optimized to O(n log n) using a merge sort variant.

But in this case specially, the values in the array can be at most n, and consecutive values have absolute difference of exactly 1, so we can create an algorithm that computes those inversions on the fly in O(n):

def solve(A):
    count = 0
    accum = 0
    total = 0
    seen = {0: 1}

    for i, element in enumerate(A):
        if element == 'H': 
            count += 1
            accum -= seen.get(count, 0)
        else:
            accum += seen.get(count, 0)
            count -= 1

        seen[count] = seen.get(count, 0) + 1
        total += (i + 1 - accum)

    return total

print solve([ 'H', 'T', 'H', 'T', 'T', 'H' ])

print solve([ 'T', 'H', 'H', 'H', 'T' ])

This algorithm is largely based on the one I've read here.

You could use `seen = defaultdict(int)`, to write `seen[count]` instead of `seen.get(count, 0)`. Here are [O(n**2) and O(n*log n) "number of inversions" implementations](http://stackoverflow.com/a/2989341/4279). I don't understand `total += (i + 1 - accum)` step. Is it related to [O(n) "counting sort" algorithm](https://en.wikipedia.org/wiki/Counting_sort) in anyway? — jfs, Sep 04 '15 at 19:38
@J.F.Sebastian This O(n) algorithm isn't a general one. It only works with this problem because the elements of A (when turned into numbers) can only be 0 or 1. So, each position in the accumulated array differs exactly by 1 with the previous. — Juan Lopes, Sep 04 '15 at 19:46
@J.F.Sebastian Also, `accum` is, for each element, the number of values in the accumulated array that are greater than the current one. I want the contrary, that's why I compute `i - accum`. The `+1` is because `i` is 0-based. — Juan Lopes, Sep 04 '15 at 19:48

Count substrings with a minority of tails from a string of coin tosses

1 Answers1