Possibilities of removing two characters from string to get unique ones

Question

I need to obtain a number of unique strings after removing exactly two characters. An algorithm should also give a possibility to print these new words. The input string is made from 'a' to 'z' ASCII characters only.

For example for an input string "pizza" we have 7 strings: "zza", "iza", "izz", "pza", "pzz", "pia", "piz".

My priority is high efficiency of an algorithm.

I abandoned idea with set containers to avoid duplicates due to a huge memory complexity (for 2400-character input string I filled whole RAM and swap). This solution also seems not to be so fast.

So I decided to design smarter algorithm which is similar to RLE compression, so I showed "pizza" as blocks: 1xP, 1xI, 2xZ, 1xA. Now we can see that for the first removing character it only matters from which block I removed character (it does not matter which 'z' we removed). So in the first stage we are removing one character from for example first block. In the second stage we are removing second character from still existing blocks. (We repeat procedure in a loop for an every case).

The problem is this solution will work good for removing one character. But for two characters it may generate dupicates. For example input "aba" we have blocks 1xA, 1xB, 1xA: we are removing one character from first block and we get 1xB, 1xA. Next we are removing next character from first existing block (1xB) and we get 1xA. Next algorithm goes into next block of original input string (1xB) and removes one character belonging to the block and we get 2xA. Next we are removing second character from existing blocks (we have got only one block), and we get 1xA. Here we have got duplicate.

In general the algorithm will fail with such words "abababababa..." (we will get same case when we remove first "ab" or any other neighboring.

I need suggestions how to neatly deal with this duplication. Maybe you got completely different ideas for an algorithm.

You can get an `O(n²)` time and `O(n)` space algorithm with two embedded `for` loops to select the two characters to remove. — Balthazar, Oct 31 '16 at 20:25
I know, but how to deal with these duplicates which I mentioned? — Madras, Oct 31 '16 at 20:27
You could inspire yourself from [this](http://stackoverflow.com/questions/2560262/generate-all-unique-substrings-for-given-string) question. — Balthazar, Oct 31 '16 at 20:30
Can you post your code? I would have expected your original solution (with set containers) to work fine and only require a few 10s of megabytes of RAM. — Peter de Rivaz, Oct 31 '16 at 21:04
I made a mistake, should be length of input string equal to 2400 not 400. But it does not matter, because I cannot store in memory all possibilities. — Madras, Oct 31 '16 at 21:18
@Bim What exactly should inspire me? :) There is a lot of information under the link you posted. — Madras, Oct 31 '16 at 21:23

גלעד ברקן · Accepted Answer · 2016-11-10T01:45:01.913

4

I think there is an O(n) time, O(1) space solution for calculating the number (O(n^3) worst case of course for the actual strings). For any one character removed, consider that a string would equal another only when the character removed is from the same contiguous block of identical characters. For example, "piz_a" and "pi_za".

First, we'll count non-adjacent removals. Calculate a tally in O(n) time representing how many blocks of contiguous identical characters are to one side, say left, of the current character, excluding the block the character is in. For example, in "pizza", we would have [0,1,2,2,3]. As we traverse, add the number in array[i-1] to the total, but only if the current char is different from the previous. For "pizza", we would have `(skip) + 0 + 1 + (skip) + 2 = 3 so far. (I used an array for demonstration, although we only need to keep the previous element.)

Now traverse a single window of two adjacent characters, adding 1 only when the two characters do not equal the previous two and the second character in the pair does not equal the character before the pair (like "ba" in "aba"). For "pizza", we get (skip) + 1 + 1 + 1 + 1 = 4, which together with the count for non-adjacent removals sums to 7.

For "ababab", the pair traversal would only count the first "ab" removal, while the non-adjacent pair traversal would count: "_b_bab", "_ba_ab", "a_a_ab", "_bab_b", "a_ab_b", "ab_b_b", "_baba_", "a_aba_", "ab_ba_", and "aba_a_".

To print the actual strings, follow the same idea of enumerating adjacent-pair removals separately from non-adjacent, using the principles above.

edited Nov 10 '16 at 01:45

answered Nov 01 '16 at 11:08

גלעד ברקן

23,602
3
25
61

An algorithm should also give a possibility to print these new words. – Alexander Anikin Nov 01 '16 at 11:49
@AlexanderAnikin thank you for your comment. I think the idea I presented is easy enough to adapt to print the strings. In the case of the adjacent-pair traversal, we are counting each actual possible removal; and in the case of non-adjacent pairs, it's a matter of enumerating the combinations, one removal each of course for any contiguous identical block. – גלעד ברקן Nov 01 '16 at 11:55
Thank you for your answer. I implemented your idea (at the moment without printing) and it works in most cases. Unfortunately there are some words where my implementation generates invalid results. For example "aba" result 3 (correct is 2), "aabaa" result 3 (correct is 4). So there are at least two bugs. Here is the code, I marked my doubts in comments: http://pastebin.com/jUjqqCkT – Madras Nov 03 '16 at 22:49
@Madras Thanks you for your comment, alerting me to possible issues. For `"aba"`, the logic would be: 1 non-adjacent pair removal, `"_b_"` + 1 adjacent pair removal, `"__a"` (since we add 1 "only when the two characters do not equal the previous two and the second character in the pair does not equal the character before the pair (like "ba" in "aba")."). The logic for `"aabaa"` would be: 1 non-adjacent: [0,0,1,2,2] => skip + 0 + 0 + 1 + skip (which is `"_ab_a"`) + 3 adjacent `"__baa", "a__aa", "aab__"` No? – גלעד ברקן Nov 04 '16 at 01:20
@Madras One of the comments in your pastebin link said you didn't understand a condition (which one?) - please let me know if there's something I can better explain or clarify. – גלעד ברקן Nov 04 '16 at 01:26
You have written "adding 1 only when the two characters do not equal the previous two and the second character in the pair does not equal the character before the pair". For example "abcd" word is given and the loop iterator is on 'd', then if I compare two pairs it should be "ab" compared to "cd". In case of one pair comparison I think it should be 'b' compared to 'd'. I don't know when to compare two pairs of characters and when only one pair. In code I implemented the second case. – Madras Nov 04 '16 at 02:06
@Madras I reasoned that there would be two conditions to check for the adjecent-pair traversal. The two conditions for "abcd" when the pair being checked is "cd" would be something like: `if "cd" != "ab" and "d" != "b" then add 1`. Does that make sense? – גלעד ברקן Nov 04 '16 at 02:15
I realised that I made a logic mistake, because I got into a rut. I thought that "cd" != "ab" means ('c' != 'a' AND 'd' != 'b'), but should be OR instead AND. Now it does make sense, previously one comparison overlayed condition of comparison of one pair and it made me confused. – Madras Nov 04 '16 at 03:41
Why for "aabaa" you got [0,0,1,2,2] => skip + 0 + 0 + 1 + skip? I have got [0,0,1,2,2] => skip + skip + 0 + 1 + skip. I checked the code and it seems to be correct according to "add the number in array[i-1] to the total, but only if the current char is different from the previous" as you have written. Algorithm already seems to generate correct results for all cases. Here is current solution: http://pastebin.com/0XKiNTUu – Madras Nov 04 '16 at 03:41
@Madras you're right, `[0,0,1,2,2] => skip + skip + 0 + 1 + skip` would be correct, my mistake. – גלעד ברקן Nov 04 '16 at 10:08
@גלעדברקן I optimized "cd" != "ab" and "d" != "b" condition to "d" != "b", added printing. And finally I get this code http://pastebin.com/vSW7HrBW The stage with adjacent removals has O(n^2) complexity (n pairs and every string contains n-2 characters to print). The stage with non adjacent removals has O(n^3) complexity due to selecting two characters and printing n-2 characters for every word. Is this O(n^3) already optimal complexity? Or can it be done better using for example dynamic programming? – Madras Nov 09 '16 at 23:57
@Madras that seems about right - printing `O(n^2)` removals, where each removal is applied to a string of length `n` would be `O(n^3)` time and space overall. I'm not sure about how one might optimize the printing process. With my limited experience, and depending on your needs, I might choose to use each pair of removal indexes in language-specific string manipulation methods, which hopefully are optimized already. – גלעד ברקן Nov 10 '16 at 01:56

Possibilities of removing two characters from string to get unique ones

1 Answers1