3

I am trying to generate all the unique strings, that can be created by removing one or more characters from the original string.

For example, for the original string "anna", the resulting strings would be nna|ana|ann|na|aa|nn|an|a|n. I am not simply looking for unique substrings, as for example "aa" is not a substring of "anna", yet it is included in the set of the strings I am looking for.

The code below generates the correct result, but is too slow for when the original string is extraordinarily long, for example 50 characters or so. I am looking for suggestions on how to make it more efficient.

public class Permutation3 {

    private static Set<String> hs = new HashSet<>();

    public static void main(String[] args) {
        perm("ANNA", 0);
        System.out.println(hs.size() - 2);
    }

    private static void perm(String string, int startIndex) {

        hs.add(string);
        for (int i = startIndex; i <= string.length() - 1; i++) {
            StringBuilder sb = new StringBuilder(string);
            sb.deleteCharAt(i);
            perm(sb.toString(), i);

        }
    }
Sofia K
  • 79
  • 5
  • 50 characters doesn't seem to big, now if it were 50,000... – Jorn Vernee Jan 10 '17 at 12:39
  • 2
    50 characters has up to 2^^50 results so you would expect it to take a long time. – Peter Lawrey Jan 10 '17 at 12:45
  • 1
    @Null it's not a duplicate, as the questions are entirely different. The one you linked asks for substrings, while this one asks for all possible strings that can be generated by removing chars. E.g. for the word "ANNA", "AA" wouldn't be part of the solution to the linked problem, but it would be for this one. –  Jan 10 '17 at 12:45
  • @Paul yes, thank you. For this particular case, the number of solutions is much higher than the number of unique substrings. – Sofia K Jan 10 '17 at 12:50
  • Have you considered suffix array using? – MBo Jan 10 '17 at 12:57
  • @PeterLawrey I stand corrected. – Jorn Vernee Jan 10 '17 at 13:04
  • 1
    If you just want to know the total number, then see this previous question: http://stackoverflow.com/questions/5151483/how-to-find-the-number-of-distinct-subsequences-of-a-string – Peter de Rivaz Jan 10 '17 at 13:05

1 Answers1

3

First off: how many possible strings can be produced by arbitrarily removing characters from a string with length n?

We can represent each String that can be produced this way by a bit-field of length n, where each bit denotes whether a certain character has been removed. Or in other words: there are 2^n possible strings that can be generated this way.

So for an input-string of length 50 there's a total of 1,125899907E15 distinct strings that could be generated that way. That's a bit more than one quadrillion (short scale). Or in other words: if each word could be written into one byte the produced words would still fill up 1PB.

So the actual problem isn't that much the performance, but simply the gigantic number of values that will be generated.

Of course the code could be tweaked a bit:
For example recursion is quite expensive and your code produces some solutions multiple times.

Yet the main-problem is rather that the output is growing exponential to the input-size and thus any algorithm will reach the limits of what your machine is capable of pretty fast.

  • Indeed printing all strings is a strange task. It's more interesting and realistic task to print only Nth string of all the strings generated by removing characters (for example in the lexicographical order) – maxteneff Jan 10 '17 at 13:08
  • @maxteneff for smaller input-sizes, this might actually be an interesting problem for tweaking. But for "large" `n` the output is simply to large to be handled. –  Jan 10 '17 at 13:10
  • No, I mean to print only one string which is on the Nth place of lexicographically sorted all generated strings. So the task could be to generate this one string. – maxteneff Jan 10 '17 at 13:15
  • @maxteneff I did understand that. I was just talking about this ("generate unique strings...") problem. Producing the Nth String is definitely interesting as well. –  Jan 10 '17 at 13:23