Reduce binary string to an empty string by removing subsequences with alternative characters

Question

This was a question asked in the coding round for NASDAQ internship.

Program description:

The program takes a binary string as input. We have to successively remove sub-sequences having all characters alternating, till the string is empty. The task was to find the minimum number of steps required to do so.

Example1:
let the string be : 0111001
Removed-0101, Remaining-110
Removed-10 , Remaining-1
Removed-1
No of steps = 3

Example2:
let the string be : 111000111
Removed-101, Remaining-110011
Removed-101, Remaining-101
Removed-101
No of steps = 3

Example3:
let the string be : 11011
Removed-101, Remaining-11
Removed-1 , Remaining-1
Removed-1
No of steps = 3

Example4:
let the string be : 10101
Removed-10101
No of steps = 1

The solution I tried, considered the first character of the binary string as first character for my sub-sequence. Then created a new string, where the next character would be appended if it wasn't part of the alternating sequence. The new string becomes our binary string. In this way, a loop continues till the new string is empty. (somewhat an O(n^2) algorithm). As expected, it gave me a timeout error. Adding a somewhat similar code in C++ to the one I had tried, which was in Java.

    #include<bits/stdc++.h>
    using namespace std;
    
    int main() {
        string str, newStr;
        int len;
        char c;
        int count = 0;
        getline(cin, str);
        len = str.length();
    
        //continue till string is empty
        while(len > 0) {
            len = 0;
            c = str[0];
            for(int i=1; str[i] != '\0';i++) {
                //if alternative characters are found, set as c and avoid that character
                if(c != str[i]) 
                    c = str[i];
                //if next character is not alternate, add the character to newStr
                else {
                    newStr.push_back(str[i]);
                    len++;
                }
            }
            str = newStr;
            newStr = "";
            count++;
        }
        cout<<count<<endl;
        return 0;
    }

I also tried methods like finding the length of the largest sub sequence of same consecutive characters which obviously didn't satisfy every case, like that of example3.

Hope somebody could help me with the most optimized solution for this question. Preferably a code in C, C++ or python. Even the algorithm would do.

If you believe your algorithm was correct but merely too slow, it would be very helpful to provide the actual code you wrote in order for answerers to be able to gauge whether it's sufficient to merely improve the performance of your approach (and show you why your particular implementation was slow), or go through the effort of proposing a more efficient approach entirely. — Patrick Roberts, Nov 23 '20 at 04:59
Yes, Ill add that for reference. I had done it in java(only option provided) on their online platform, so i don't have the exact code. Ill write almost the same in c++. — Bichu Ben, Nov 23 '20 at 05:28

Serial Lazer · Answer 1 · 2020-11-23T07:41:57.400

I found a more optimal O(NlogN) solution by maintaining a Min-Heap and Look-up hashMap.

We start with the initial array as alternating counts of 0, 1.

That is, for string= 0111001; lets assume our input-array S=[1,3,2,1]

Basic idea:

Heapify the count-array
Extract minimum count node => add to num_steps
Now extract both its neighbours (maintained in the Node-class) from the Heap using the lookup-map
Merge both these neighbours and insert into the Heap
Repeat steps 2-4 until no entries remain in the Heap

Code implementation in Python

class Node:
    def __init__(self, node_type: int, count: int):
        self.prev = None
        self.next = None
        self.node_type = node_type
        self.node_count = count

    @staticmethod
    def compare(node1, node2) -> bool:
        return node1.node_count < node2.node_count


def get_num_steps(S: list): ## Example: S = [2, 1, 2, 3]
    heap = []
    node_heap_position_map = {} ## Map[Node] -> Heap-index
    prev = None
    type = 0
    for s in S:
        node: Node = Node(type, s)
        node.prev = prev
        if prev is not None:
            prev.next = node
        prev = node
        type = 1 - type

        # Add element to the map and also maintain the updated positions of the elements for easy lookup
        addElementToHeap(heap, node_heap_position_map, node)

    num_steps = 0
    last_val = 0
    while len(heap) > 0:
        # Extract top-element and also update the positions in the lookup-map
        top_heap_val: Node = extractMinFromHeap(heap, node_heap_position_map)
        num_steps += top_heap_val.node_count - last_val
        last_val = top_heap_val.node_count

        # If its the corner element, no merging is required
        if top_heap_val.prev is None or top_heap_val.next is None:
            continue

        # Merge the nodes adjacent to the extracted-min-node:
        prev_node = top_heap_val.prev
        next_node = top_heap_val.next

        removeNodeFromHeap(prev_node, node_heap_position_map)
        removeNodeFromHeap(next_node, node_heap_position_map)
        del node_heap_position_map[prev_node]
        del node_heap_position_map[next_node]
        
        # Created the merged-node for neighbours and add to the Heap; and update the lookup-map
        merged_node = Node(prev_node.node_type, prev_node.node_count + next_node.node_count)
        merged_node.prev = prev_node.prev
        merged_node.next = next_node.next
        addElementToHeap(heap, node_heap_position_map, merged_node)

    return num_steps

PS: I havent implemented the Min-heap operations above, but the function-method-names are quite eponymous.

I see you have found a more efficient approach in line with the solutions proposed by @moreON. Thanks. But I strongly feel an O(n) solution is possible which may not require usage of heaps as the test provides you with just 1 hour for the solution. Not sure of it though. Do write if you can think of an even better solution. Cheers!! — Bichu Ben, Nov 23 '20 at 08:21
Nope. Don't have a clue, Also sorry, I'm not that good with heaps. Just a beginner. Could you please elaborate on how that hash map is used here? Also some insights about the merging of neighbours done here. — Bichu Ben, Nov 23 '20 at 08:33
Short-answer: `Hash-map is used as a lookup map to find the position of a node in the heap; so we can remove it` That way we remove both the neighbours of the min-node and finally insert a `merged-node` to the heap — Serial Lazer, Nov 23 '20 at 08:38
I assume that by neighbours, you meant the neighbours of the minimum node. By that assumption, consider cases like, multiple min nodes, neighbours also being min nodes, nodes that dont have neighbours. Are all such cases considered? — Bichu Ben, Nov 23 '20 at 08:54
@BichuBen yes all such cases are considered. In case of multiple minimum nodes, notice that the removal order doesnt really matter — Serial Lazer, Nov 23 '20 at 08:57
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/224959/discussion-between-serial-lazer-and-bichu-ben). — Serial Lazer, Nov 23 '20 at 09:01
Please see my [answer](https://stackoverflow.com/a/64983145/2034787). — גלעד ברקן, Nov 24 '20 at 08:58

moreON · Answer 2 · 2020-11-23T09:06:00.387

2

I won't write the full code. But I have an idea of an approach that will probably be fast enough (certainly faster than building all of the intermediate strings).

Read the input and change it to a representation that consists of the lengths of sequences of the same character. So 11011 is represented with a structure that specifies it something like [{length: 2, value: 1}, {length: 1, value: 0}, {length: 2, value: 1}]. With some cleverness you can drop the values entirely and represent it as [2, 1, 2] - I'll leave that as an exercise for the reader.

With that representation you know that you can remove one value from each of the identified sequences of the same character in each "step". You can do this a number of times equal to the smallest length of any of those sequences.

So you identify the minimum sequence length, add that to a total number of operations that you're tracking, then subtract that from every sequence's length.

After doing that, you need to deal with sequences of 0 length. - Remove them, then if there are any adjacent sequences of the same value, merge those (add together the lengths, remove one). This merging step is the one that requires some care if you're going for the representation that forgets the values.

Keep repeating this until there's nothing left. It should run somewhat faster than dealing with string manipulations.

There's probably an even better approach that doesn't iterate through the steps at all after making this representation, just examining the lengths of sequences starting at the start in one pass through to the end. I haven't worked out what that approach is exactly, but I'm reasonably confident that it would exist. After trying what I've outlined above, working that out is a good idea. I have a feeling it's something like - start total at 0, keep track of minimum and maximum total reaches. Scan each value from the start of string, adding 1 to the total for each 1 encountered, subtracting 1 for each 0 encountered. The answer is the greater of the absolute values of the minimum and maximum reached by total. - I haven't verified that, it's just a hunch. Comments have lead to further speculation that doing this but adding together the maximum and absolute of minimum may be more realistic.

edited Nov 23 '20 at 09:06

answered Nov 23 '20 at 06:19

moreON

1,977
17
19

Thanks. Your approach using count of repeated characters seems much faster than using strings in most cases, ignoring some exception cases like [5,4,3,2,1] . So, convert to list representation, loop till list is empty, find min, subtract min from all values and add to count, generate new list by using something like 2 pointer method I guess, and finally print. Hope this is what you meant. – Bichu Ben Nov 23 '20 at 07:27
As for your 2nd approach, that is something I too am searching for. A method to find the answer in a single pass. Your hunch felt right, but it doesn't work for cases like 100100. Answer should be 3, but max reach is 1 and min reach is -2, so that would give 2 as answer. Hope you can come up with a solution. – Bichu Ben Nov 23 '20 at 07:54
@BichuBen `solution2()` is the approach described in this answer: https://godbolt.org/z/v3zc9q – Patrick Roberts Nov 23 '20 at 07:57
@PatrickRoberts could you just explain the logic? It would have been much better if you could use `using namespace std`, to avoid the the usage of `::std` everywhere in your code. – Bichu Ben Nov 23 '20 at 08:03
@PatrickRoberts your `solution2()` is of `O(N^2)` complexity, definitely not linear – Serial Lazer Nov 23 '20 at 08:12
@SerialLazer I never claimed it was linear? – Patrick Roberts Nov 23 '20 at 08:22
@PatrickRoberts The 2nd approach being discussed in the above solution is linear. Actually OP already has a O(N^2) solution as mentioned in the question, he's looking for more optimal solution – Serial Lazer Nov 23 '20 at 08:26
1

Is it what I said, but _adding_ the max and the absolute value of the minimum instead of selecting one? Still spitballing, not actually solving. @BichuBen – moreON Nov 23 '20 at 08:28
@SerialLazer `solution2()` in the link I provided was an implementation of the first approach given here. The "second approach" you're referring to in this answer isn't really an algorithm, it's just a hypothetical musing at the moment. – Patrick Roberts Nov 23 '20 at 08:37
@PatrickRoberts I understand, but doesnt matter because we are only looking for solutions better than O(N^2). – Serial Lazer Nov 23 '20 at 08:39
@moreON that seems like a better hunch. Let me see if I can find any discrepancies. SerialLazer, PatrickRoberts , you guys can join in as well. – Bichu Ben Nov 23 '20 at 08:41
@SerialLazer you seem to be under the assumption that asymptotic complexity is the only factor I'm considering. It's still much more efficient than the initial answer given in the question. Even the `solution1()` is more efficient since it's not allocating new strings on each iteration, just operating in-place. – Patrick Roberts Nov 23 '20 at 08:45
@PatrickRoberts sure maybe you can post it as a separate answer then – Serial Lazer Nov 23 '20 at 08:51
I'm not entirely convinced that my first approach _is_ O(n ^ 2) - In order to get to a point that you need to do more iterations, you need longer and different sequences. Longer sequences means fewer sequences. I think it's probably more likely O(n ^ 1.5) (aka n sqrt(n)). It's still no n log(n), but it's something. Being certain of that answer requires a much more in depth analysis though. – moreON Nov 23 '20 at 09:01
@moreON The worst case is O(n^2)... but it still performs much better than mine in average cases. – Bichu Ben Nov 23 '20 at 09:08
You say that, but there's a maximum of O(sqrt(n)) distinct sequence lengths that can be present for an input of length n which limits iterating through the whole sequence and decrementing, then finding new minima, etc. to that many times. - maybe there's something in the merging that makes this simplistic analysis fail and you're right though. – moreON Nov 23 '20 at 09:11
@moreON Could you add your hypothesis as an answer? The one taking **sum of max and absolute(min)** as the answer. I feel that could be the answer. If more people read that, maybe we could get some corner cases or maybe even a solid proof. – Bichu Ben Nov 23 '20 at 09:33
I think I came to a similar [conclusion](https://stackoverflow.com/a/64983145/2034787) as you did. – גלעד ברקן Nov 24 '20 at 09:13

גלעד ברקן · Accepted Answer · 2020-11-24T14:33:23.163

We can solve this in O(n) time and O(1) space.

This isn't about order at all. The actual task, when you think about it, is how to divide the string into the least number of subsequences that consist of alternating characters (where a single is allowed). Just maintain two queues or stacks; one for 1s, the other for 0s, where characters pop their immediate alternate predecessors. Keep a record of how long the queue is at any one time during the iteration (not including the replacement moves).

Examples:

(1)

0111001

   queues
1  1   -
0  -   0
0  -   00
1  1   0
1  11  -
1  111 -  <- max 3
0  11  0

For O(1) space, The queues can just be two numbers representimg the current counts.

(2)

111000111
   queues (count of 1s and count of 0s)
1  1  0
1  2  0
1  3  0  <- max 3
0  2  1
0  1  2
0  0  3  <- max 3
1  1  2
1  2  1
1  3  0  <- max 3

(3)

11011
   queues
1  1  0
1  2  0
0  1  1
1  2  0
1  3  0  <- max 3

(4)

10101

   queues
1  1  0  <- max 1
0  0  1  <- max 1
1  1  0  <- max 1
0  0  1  <- max 1
1  1  0  <- max 1

Thanks mate. I had received an almost same answer from @moreON in on one of the other answer's comments. But we were unable to prove why the logic even worked. Your explanation is the thing I had been searching for. Cheers!! — Bichu Ben, Nov 24 '20 at 13:37
@BichuBen yes, after I posted this answer, I noticed that moreON at the end of their answer had suggested this kind of approach might work, and I noted that in a [comment](https://stackoverflow.com/questions/64962657/reduce-binary-string-to-an-empty-string-by-removing-subsequences-with-alternativ#comment114886497_64963408). — גלעד ברקן, Nov 24 '20 at 14:47

score 1 · Answer 4 · answered Nov 30 '20 at 10:17

Time complexity - O(n)

void solve(string s) {
    int n = s.size();
    int zero = 0, One = 0, res = 0;
    
    for (int i = 0; i < n; i++) 
    {
        if (s[i] == '1') 
        {
            if (zero > 0) 
                zero--;
            else 
                res++;
            
            One++;
        }
        
        else
        {
            if (One > 0) 
                One--;
            else 
                res++;
            
            zero++;
        }
    }
    cout << res << endl;
}

Reduce binary string to an empty string by removing subsequences with alternative characters

4 Answers4

Basic idea:

Code implementation in Python