Longest Subsequence with all occurrences of a character at 1 place

Question

In a sequence S of n characters; each character may occur many times in the sequence. You want to find the longest subsequence of S where all occurrences of the same character are together in one place;

For ex. if S = aaaccaaaccbccbbbab, then the longest such subsequence(answer) is aaaaaaccccbbbb i.e= aaa__aaacc_ccbbb_b.

In other words, any alphabet character that appears in S may only appear in one contiguous block in the subsequence. If possible, give a polynomial time algorithm to determine the solution.

I must say, I don't understand. Which is the desired output? The 'aaaaaaccccbbbb' or 'aaa__aaacc_ccbbb_b'? Why 'c' is before 'b' in both of the 'solutions'? Is it significant? It looks like a simple O(N) problem... — ishi, Nov 19 '12 at 20:09
Answer is aaaaaaccccbbbb, aaa__aaacc_ccbbb_b is only the explanation for the original answer. — pxm, Nov 19 '12 at 20:24
No boundtions for answer, The solution can include all distinct characters or only one if it's longest. & yes it's subSEQUENCE so you hve to maintain the order. — pxm, Nov 19 '12 at 20:27
Would it be correct to phrase the problem as follows? "Find the fewest positions in S such that if those characters are removed, the remaining sequence has the property that if it is decomposed into runs of identical characters, no two runs are for the same character." — Ted Hopp, Nov 19 '12 at 20:32
How many distinct characters are present in the original sequence? In your example it is only 3 (a,b,c) so there are quick solutions possible. Is this always the case, or might there be all 26 letters of the alphabet in the input sequence (or more)? (I am considering solving it via DP for each permutation of characters, but this will take far too long if you have a large number of distinct characters) — Peter de Rivaz, Nov 19 '12 at 20:37
heh... and here I thought I understood how it's supposed to work. If "solution can (..) only one [character] if it's longest", why 'aaaaaaa' (7x a) isn't the answer? It's def. the longest... — ishi, Nov 19 '12 at 20:38
Yes, problem can contain as many characters. Basically we have to remove the characters and obtain the longest sub sequence such that all occurrences of a character(if included) in answer occur at 1 place. — pxm, Nov 19 '12 at 21:03

j_random_hacker · Accepted Answer · 2012-11-20T13:58:45.197

Design

Below I give a C++ implementation of a dynamic programming algorithm that solves this problem. An upper bound on the running time (which is probably not tight) is given by O(g*(n^2 + log(g))), where n is the length of the string and g is the number of distinct subsequences in the input. I don't know a good way to characterise this number, but it can be as bad as O(2^n) for a string consisting of n distinct characters, making this algorithm exponential-time in the worst case. It also uses O(ng) space to hold the DP memoisation table. (A subsequence, unlike a substring, may consist of noncontiguous character from the original string.) In practice, the algorithm will be fast whenever the number of distinct characters is small.

The two key ideas used in coming up with this algorithm were:

Every subsequence of a length-n string is either (a) the empty string or (b) a subsequence whose first element is at some position 1 <= i <= n and which is followed by another subsequence on the suffix beginning at position i+1.
If we append characters (or more specifically character positions) one at a time to a subsequence, then in order to build all and only the subsequences that satisfy the validity criteria, whenever we add a character c, if the previous character added, p, was different from c, then it is no longer possible to add any p characters later on.

There are at least 2 ways to manage the second point above. One way is to maintain a set of disallowed characters (e.g. using a 256-bit array), which we add to as we add characters to the current subsequence. Every time we want to add a character to the current subsequence, we first check whether it is allowed.

Another way is to realise that whenever we have to disallow a character from appearing later in the subsequence, we can achieve this by simply deleting all copies of the character from the remaining suffix, and using this (probably shorter) string as the subproblem to solve recursively. This strategy has the advantage of making it more likely that the solver function will be called multiple times with the same string argument, which means more computation can be avoided when the recursion is converted to DP. This is how the code below works.

The recursive function ought to take 2 parameters: the string to work on, and the character most recently appended to the subsequence that the function's output will be appended to. The second parameter must be allowed to take on a special value to indicate that no characters have been appended yet (which happens in the top-level recursive case). One way to accomplish this would be to choose a character that does not appear in the input string, but this introduces a requirement not to use that character. The obvious workaround is to pass a 3rd parameter, a boolean indicating whether or not any characters have already been added. But it's slightly more convenient to use just 2 parameters: a boolean indicating whether any characters have been added yet, and a string. If the boolean is false, then the string is simply the string to be worked on. If it is true, then the first character of the string is taken to be the last character added, and the rest is the string to be worked on. Adopting this approach means the function takes only 2 parameters, which simplifies memoisation.

As I said at the top, this algorithm is exponential-time in the worst case. I can't think of a way to completely avoid this, but some optimisations can help certain cases. One that I've implemented is to always add maximal contiguous blocks of the same character in a single step, since if you add at least one character from such a block, it can never be optimal to add fewer than the entire block. Other branch-and-bound-style optimisations are possible, such as keeping track of a globally best string so far and cutting short the recursion whenever we can be certain that the current subproblem cannot produce a longer one -- e.g. when the number of characters added to the subsequence so far, plus the total number of characters remaining, is less than the length of the best subsequence so far.

Code

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <functional>
#include <map>

using namespace std;

class RunFinder {
    string s;
    map<string, string> memo[2];    // DP matrix

    // If skip == false, compute the longest valid subsequence of t.
    // Otherwise, compute the longest valid subsequence of the string
    // consisting of t without its first character, taking that first character
    // to be the last character of a preceding subsequence that we will be
    // adding to.
    string calc(string const& t, bool skip) {
        map<string, string>::iterator m(memo[skip].find(t));

        // Only calculate if we haven't already solved this case.
        if (m == memo[skip].end()) {
            // Try the empty subsequence.  This is always valid.
            string best;

            // Try starting a subsequence whose leftmost position is one of
            // the remaining characters.  Instead of trying each character
            // position separately, consider only contiguous blocks of identical
            // characters, since if we choose one character from this block there
            // is never any harm in choosing all of them.
            for (string::const_iterator i = t.begin() + skip; i != t.end();) {
            if (t.end() - i < best.size()) {
                // We can't possibly find a longer string now.
                break;
            }

                string::const_iterator next = find_if(i + 1, t.end(), bind1st(not_equal_to<char>(), *i));
                // Just use next - 1 to cheaply give us an extra char at the start; this is safe
                string u(next - 1, t.end());
                u[0] = *i;      // Record the previous char for the recursive call
                if (skip && *i != t[0]) {
                    // We have added a new segment that is different from the
                    // previous segment.  This means we can no longer use the
                    // character from the previous segment.
                    u.erase(remove(u.begin() + 1, u.end(), t[0]), u.end());
                }
                string v(i, next);
                v += calc(u, true);

                if (v.size() > best.size()) {
                    best = v;
                }

                i = next;
            }

            m = memo[skip].insert(make_pair(t, best)).first;
        }

        return (*m).second;
    }

public:
    RunFinder(string s) : s(s) {}

    string calc() {
        return calc(s, false);
    }
};

int main(int argc, char **argv) {
    RunFinder rf(argv[1]);
    cout << rf.calc() << '\n';
    return 0;
}

Example results

C:\runfinder>stopwatch runfinder aaaccaaaccbccbbbab
aaaaaaccccbbbb
stopwatch: Terminated. Elapsed time: 0ms
stopwatch: Process completed with exit code 0.

C:\runfinder>stopwatch runfinder abbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,mnnsdbbbf
stopwatch: Terminated. Elapsed time: 609ms
stopwatch: Process completed with exit code 0.

C:\runfinder>stopwatch -v runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop
stopwatch: Command to be run: <runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop>.
stopwatch: Global memory situation before commencing: Used 2055507968 (49%) of 4128813056 virtual bytes, 1722564608 (80%) of 2145353728 physical bytes.
stopwatch: Process start time: 21/11/2012 02:53:14
abcdefghijklmnopqrstuvwxyz123456
stopwatch: Terminated. Elapsed time: 8062ms, CPU time: 7437ms, User time: 7328ms, Kernel time: 109ms, CPU usage: 92.25%, Page faults: 35473 (+35473), Peak working set size: 145440768, Peak VM usage: 145010688, Quota peak paged pool usage: 11596, Quota peak non paged pool usage: 1256
stopwatch: Process completed with exit code 0.
stopwatch: Process completion time: 21/11/2012 02:53:22

The last run, which took 8s and used 145Mb, shows how it can have problems with strings containing many distinct characters.

EDIT: Added in another optimisation: we now exit the loop that looks for the place to start the subsequence if we can prove that it cannot possibly be better than the best one discovered so far. This drops the time needed for the last example from 32s down to 8s!

Thanks. I also worked out an onother solution i.e Find the LCS(longest Common Subsequence) between 'original' string and the string with all char. together like aaaaaaabbbbbcccccc. Find LCS between these 2 strings for all permutations aaaaaaa,bbbbb,cccccc in the 2nd string. Though Complexity is poor =O(p!*n^2) — pxm, Nov 21 '12 at 09:38
@RahulGandhi: You're welcome. Just thought of a new optimisation: `calc()` doesn't actually care about the particular letters, only their pattern. E.g. the result of `calc("YYYXX", true)` is the same as the result of `calc("AAABB", true)` after renaming all `A`s to `Y`s and all `B`s to `X`s. Converting strings down to a canonical form (e.g. first letter => `A`, next different letter => `B` etc.) and then converting the results back would drastically reduce the number of distinct strings seen by `calc()` in many cases, meaning more work is avoided and more memory saved. — j_random_hacker, Nov 21 '12 at 10:57

Ted Hopp · Answer 2 · 2012-11-19T21:49:13.207

EDIT: This solution is wrong for OP's problem. I'm not deleting it because it might be right for someone else. :)

Consider a related problem: find the longest subsequence of S of consecutive occurrences of a given character. This can be solved in linear time:

char c = . . .; // the given character
int start = -1;
int bestStart = -1;
int bestLength = 0;
int currentLength = 0;
for (int i = 0; i < S.length; ++i) {
    if (S.charAt(i) == c) {
        if (start == -1) {
            start = i;
        }
        ++currentLength;
    } else {
        if (currentLength > bestLength) {
            bestStart = start;
            bestLength = currentLength;
        }
        start = -1;
        currentLength = 0;
    }
}
if (bestStart >= 0) {
    // longest sequence of c starts at bestStart
} else {
    // character c does not occur in S
}

If the number of distinct characters (call it m) is reasonably small, just apply this algorithm in parallel to each character. This can be easily done by converting start, bestStart, currentLength, bestLength to arrays m long. At the end, scan the bestLength array for the index of the largest entry and use the corresponding entry in the bestStart array as your answer. The total complexity is O(mn).

I think this is a great solution if all instances of the character have to be together in the original string. However, I think this question has the slightly different requirement that all instances of the character have to be together in the resulting subsequence, but not necessarily in the original string. What would your algorithm give for the given example? I think it would give aaaccbbb. — Peter de Rivaz, Nov 19 '12 at 20:22
@PeterdeRivaz - Okay, it took me a while re-reading OP's requirement. I think I understand it now (sort of). I see that this "solution" is probably irrelevant to the actual problem. — Ted Hopp, Nov 19 '12 at 20:29
I got an hint for solving this with State Machine but i am unable to do it. — pxm, Nov 19 '12 at 21:08

score -1 · Answer 3 · edited Nov 20 '12 at 02:59

import java.util.*;

public class LongestSubsequence {

    /**
     * @param args
     */
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        String str = sc.next();

        execute(str);

    }


    static void execute(String str) {

        int[] hash = new int[256];
        String ans = "";

        for (int i = 0; i < str.length(); i++) {

            char temp = str.charAt(i);

            hash[temp]++;
        }

        for (int i = 0; i < hash.length; i++) {
            if (hash[i] != 0) {
                for (int j = 0; j < hash[i]; j++)
                    ans += (char) i;
            }
        }

        System.out.println(ans);
    }
}

Space: 256 -> O(256), I don't if it's correct to say this way..., cause O(256) I think is O(1) Time: O(n)

All this does is list out all letters in the string in alphabetical order. That has the side effect of grouping identical characters into blocks, but it ignores the fact that if two instances of a particular character are included then every character between them in the original string that is different must be excluded. E.g. it returns `AAB` for the input `ABA` instead of `AA`. — j_random_hacker, Nov 21 '12 at 10:52

Longest Subsequence with all occurrences of a character at 1 place

3 Answers3

Design

Code

Example results