42

Here is another spoj problem that asks how to find the number of distinct subsequences of a string ?

For example,

Input
AAA
ABCDEFG
CODECRAFT

Output
4
128
496

How can I solve this problem ?

  • 3
    Heads-up: They have an *interesting* definition of a subsequence. According to their definition "AC" is a subsequence of "ABC", which it wouldn't be according to my understanding. Maybe everyone else agrees with theirs, but I thought I'd highlight it. – Joachim Sauer Mar 01 '11 at 07:11
  • 12
    @Joachim why not ? "AC" is a subsequence and "AB" and "BC" are substrings maybe you confuse with the substring and subsequence –  Mar 01 '11 at 07:16
  • 4
    maybe I do and maybe their definition is the correct one anyway, I'm not arguing that. I just said that intuitively I'd have interpreted it differently and wanted to provide a heads-up if anyone had the same (possibly wrong) idea as I had. – Joachim Sauer Mar 01 '11 at 07:18
  • I don't understand the problem. Could you explain your examples? – Colonel Panic Jul 28 '15 at 16:01

4 Answers4

64

It's a classic dynamic programming problem.

Let:

dp[i] = number of distinct subsequences ending with a[i]
sum[i] = dp[1] + dp[2] + ... + dp[i]. So sum[n] will be your answer.
last[i] = last position of character i in the given string.

A null string has one subsequence, so dp[0] = 1.

read a
n = strlen(a)
for i = 1 to n
  dp[i] = sum[i - 1] - sum[last[a[i]] - 1]
  sum[i] = sum[i - 1] + dp[i]
  last[a[i]] = i

return sum[n]

Explanation

dp[i] = sum[i - 1] - sum[last[a[i]] - 1]

Initially, we assume we can append a[i] to all subsequences ending on previous characters, but this might violate the condition that the counted subsequences need to be distinct. Remember that last[a[i]] gives us the last position a[i] appeared on until now. The only subsequences we overcount are those that the previous a[i] was appended to, so we subtract those.

sum[i] = sum[i - 1] + dp[i]
last[a[i]] = i

Update these values as per their definition.

If your indexing starts from 0, use a[i - 1] wherever I used a[i]. Also remember to wrap your computations in a mod function if you're going to submit code. This should be implemented like this:

mod(x) = (x % m + m) % m

In order to correctly handle negative values in some languages (such as C/C++).

IVlad
  • 43,099
  • 13
  • 111
  • 179
34

There exists an easier solution to this problem.

The idea is: If all character of the string are distinct, total number of subsequences is 2^n. Now, if we find any character that have already occurred before, we should consider its last occurrence only (otherwise sequence won't be distinct). So we have to subtract the number of subsequences due to its previous occurrence.

My implementation is like this:

read s
dp[0] = 1
len = strlen(s)
last[s.length()] = {-1} //declaring `last` array with same as length of string `s` and all elements initialized with -1.

for (i = 1; i <= len; i++) 
{
    dp[i] = (dp[i - 1] * 2)
    if (last[s[i]] > 0) dp[i] = (dp[i] - dp[last[s[i]] - 1])
    last[s[i]] = i
}
asn
  • 2,408
  • 5
  • 23
  • 37
Mostafiz Rahman
  • 8,169
  • 7
  • 57
  • 74
  • Thanks mostafiz. http://www.geeksforgeeks.org/count-distinct-subsequences/ it has complete solution/ – Minkesh Jain Oct 06 '17 at 10:08
  • 1
    @mostafiz why do we subtract -1 from ```last[s[i]] - 1]```, i.e. why do we do this ```dp[last[s[i]] - 1]``` and not just ```dp[last[s[i]]]``` – user3797829 Sep 26 '18 at 18:11
  • @MostafizRahman Can you explain what do you mean by `last occurrence` in your answer? – asn Jan 09 '19 at 06:39
  • @AjaySinghNegi It means no. of subsequences when dp[i] was calculated with the present character appearing previously at that time. Hence, that many no of subsequences at that time should be deducted from the present no. of subsequences so that they are not calculated redundantly again at present calculation of `dp[i] = 2*dp[i-1]`. – asn Jan 09 '19 at 07:18
  • 1
    as I've just accepted your edit, I seems like you got your answer, right? – Mostafiz Rahman Jan 09 '19 at 07:36
0

Here is my CODE:

#include<iostream>
typedef long long ll;

ll fun(std::string s,ll visited[256],ll n,ll L[]){  

    ll ans=0;
    if(n<0){
        return 1;
    }
    //std::cout<<s.substr(0,n+1)<<" "<<n<<endl;

    ans=fun(s,visited,n-1,L);
    L[n]=ans;
    ans=ans*2;
    if(visited[int(s[n])]>=0){
        ans -= L[visited[int(s[n])]];
    }
    visited[int(s[n])]=n;

    return ans;

}
int main(){

    std::string s;
    std::cin>>s;
    ll n=s.length();

    ll visited[256];
    ll L[n];
    memset(visited,-1,sizeof(visited));
    memset(L,-1,sizeof(L));

    std::cout<<fun(s,visited,n-1,L);

    return 0;
}

Explanation :

I scan from the back of a string ie- from the last element to the first and therefore send the first n-1 characters for further scanning in the recursion.

Once n==-1 or n<0(both are same), I reach on the empty string and return 1 because no. of subsequences of an empty string is 1.

So, on returning back from recursion, we know that adding the current non-duplicate character to the previous string doubles the no. of subsequences. Doubling happens because now I can add this character at the end of all the previous subsequences. So, with and without this character means double of all previous subsequences.

Assuming that the current character is not a duplicate, I multiply the previous no. of subsequences with 2.

After the total no. of subsequences of the first n-1 characters has been computed, we double them for the first n characters.

But, suppose the character currently encountered(nth character) has already been present in the first n-1 characters before(ie - found within the string s[0....n-1] (Note: s[n] is the current character)), then we have to subtract those no. of subsequences possible from up to (excluding) that part of s when the last time this current character was encountered and which has already been computed and stored in L['this particular character'].

ie - BACA - for the given string, the 4th A has already been encountered before(while returning from the recursion, we first encounter B, then A, then C and at last A) and so we deduct the no. of subsequences calculated upto (excluding) the 2nd A(which is 2 (because no. of subseq. before A is 2)).

So, every time we have calculated the no. of subsequences for the first n-1 characters, we store them in the array L.

Notice : L[k] store the no. of subsequences before the kth index.

I've used the visited array in order to check whether the given character that I'm currently present at has already been scanned through or not.

On encountering the current character, I update the visited array with the position of current position as n. This need to be done because we have to exclude the duplicate sequences.

Note: visited[] is initialized with all -1 because the position of any character in the string s is non-negative (0 based indexing).

Summary:

How do you arrive at the number of duplicates? Let's say the last occurrence of current character at i, was at j'th position. Then, we will have duplicate subsequences: consider starting with i'th character and then all subsequences possible from [0,j-1] vs. starting at j'th character and then all subsequences possible from [0,j-1]. So, to eliminate this, you subtract the number of subsequences possible from upto (excluding) j with L[0]=1 mean that upto(excluding 0), no. of subseq are 1(empty string has 1 subsequence).

asn
  • 2,408
  • 5
  • 23
  • 37
-4
///i get wa 
int finding_dist_subs(int len,char data[])
{
  dp[0]=1;
  for(int i=1;i<len;i++)
  {
      dp[i]=(dp[i-1]*2+1)%1000000007;
      for(int j=i-1;j>=0;j--)
      {
          if(data[i]==data[j])
          {
              if(j!=0)
           dp[i]=(dp[i]-(dp[j-1])-1)%1000000007;
           else dp[i]=(dp[i]-1)%1000000007;

            break;
          }
      }
  }
  return dp[len-1];
}
pradip
  • 7