0

I have a sequence 'abccabac' and a subsequence 'abc'. I need to get indices of all occurences of subsequence 'abc' in the sequence. What is the memory efficient way of doing it?

Example:

Input:

**sequence**: 'abccabac' **subsequence**: 'abc'  

Output:

 012   
 013  
 017  
 057  
 457
verdict_11
  • 35
  • 1
  • 1
  • 7
  • asking for a memory efficient way might suggests you have a way that is not memory efficient. Maybe you would be willing to share it with us to give us something to work with? – Tadhg McDonald-Jensen May 25 '16 at 16:45
  • This is the 'needle-in-a-haystack' problem. The [Knuth-Morris-Pratt algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) is generally considered the most efficient algorithm in terms of time and space complexity. Implement it and see. – Akshat Mahajan May 25 '16 at 16:50
  • @AkshatMahajan that looks like an algorithm to find a consecutive substring without repeated rechecking, This is not asking for consecutive characters. – Tadhg McDonald-Jensen May 25 '16 at 16:53
  • @TadhgMcDonald-Jensen ? OP says he's looking for "indices of all occurrences of subsequence 'abc' in the sequence". KMP is search algorithm for finding occurrence of substring in text. It can be fairly easily generalised to not halt at the first occurrence. Not sure I understand what the issue is. – Akshat Mahajan May 25 '16 at 16:57
  • yes but finding a word in a text means that all the letters of the word are obviously consecutive (come right after each other) but sub sequence of `abc` only occurs once in the sequence `abccabac` right at the beginning. the indices `0 5 7` are indices of an `a` a `b` and a `c` but not right next to each other. – Tadhg McDonald-Jensen May 25 '16 at 17:08
  • tried changing the kmp string match algorithm to fit sub-sequence match, didn't quite work out. Trying to write a recursive approach – verdict_11 May 25 '16 at 17:16
  • Ah! Okay, sorry. I was under the impression subsequence and substring meant the same. Thanks for the clarification. – Akshat Mahajan May 25 '16 at 17:28
  • This question [appears to have been asked before](http://stackoverflow.com/q/24017363/2271269). – Akshat Mahajan May 25 '16 at 17:30
  • @AkshatMahajan that only checks if one subsequence appears in a sequence, it does not generate the indices for all combinations that fit that condition. – Tadhg McDonald-Jensen May 25 '16 at 17:31

1 Answers1

2

the absolute most strait forward way I can think of is using itertools.combinations

import itertools

sequence = "abccabac"

subsequence = "abc"

for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
    if "".join(pair[1] for pair in combo) == subsequence:
        print([pair[0] for pair in combo])

If your actual sequence contains lots of characters that are not even in the subsequence then filtering out the irrelevant characters before starting the combinations would definitely provide a substantial efficiency boost:

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    ...

as well instead of joining each combination you can just use all which will only check characters until one is found to be not equal:

char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
    if all(pair[1]==char for pair,char in zip(combo,subsequence)):
        print([pair[0] for pair in combo])

Here is a different alternative I just put together, I imagine the algorithm could be optimized even more but this is about the best I could come up with. The sub_sequence_generator creates a list of indices for each letter of the sub sequence, then the sub_helper for each level of the recursion goes through all indices for the next letter of the subsequence starting after the indice of the last letter.

import itertools
import bisect

def sub_helper(cur_i, combo, all_options):
    #cur_i is the index of the letter in the subsequence
    #cur_idx is the index of the sequence which contains that letter
    if cur_i>0:
        prev_i = combo[cur_i-1]
    else:
        prev_i = -1
    cur_options = all_options[cur_i]
    start_i = bisect.bisect(cur_options,prev_i)
    cur_i_gen = itertools.islice(cur_options,start_i,None)
    if cur_i+1 == len(all_options): #last one:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield tuple(combo)
    else:
        for cur_idx in cur_i_gen:
            combo[cur_i] = cur_idx
            yield from sub_helper(cur_i+1, combo, all_options)


def sub_sequence_generator(sequence,sub_seq):
    indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
    for i,c in enumerate(sequence):
        try:
            indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
        except KeyError:
            pass

    # now we have indices for each character of the sub_sequence
    # we just need to put them in a sequence that corelates to the subsequence
    chr_indices = tuple(indices_map[c] for c in sub_seq)

    return sub_helper(0,[None]*len(chr_indices), chr_indices)

sequence = "abccabac"
subsequence = "abc"

for idxs in sub_sequence_generator(sequence,subsequence):
    print(idxs)

In terms of memory this will make a list for each character of the subsequence, containing the indices in the main sequence where that character is present. A tuple of these lists, and a list of the indices that is continually updated and a tuple for each time a combination is yielded, and iterators like islice so this is extremely memory efficient, however since I have not extensively tested it I cannot give any guarantee it is bug free.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • Not sure if you'd call this efficient. Invoking `combinations` creates 56 elements with this sequence (8 letters) and subsequence (3 letters), which are fairly small inputs, making runtime approximately O(n^2) - I shudder to think what the length would look like for longer text. :P – Akshat Mahajan May 25 '16 at 16:59
  • maybe not efficient in runtime but it is efficient in memory which is what the question was specifically asking for, I am looking into another way of doing this with sets but the logic is not as simple as this one. – Tadhg McDonald-Jensen May 25 '16 at 17:15
  • 1
    I'm not sure how it relates to big O notation but combinations always generates `n! / r! / (n-r)!` where n is the length of the sequence and r is the length of subsequence, I completely agree that this is not at all the best way to do it. – Tadhg McDonald-Jensen May 25 '16 at 17:17
  • @TadhgMcDonald-Jensen Thanks ! This is exactly what I am trying to do, my subsequence may not go more than 6-7 chars long, but my sequence may be pretty long, using itertools may make the search space explode. – verdict_11 May 25 '16 at 17:23
  • @verdict_11 I added a recursive solution, other then heavily relying on concatenating tuples it is fairly memory and execution efficient. – Tadhg McDonald-Jensen May 25 '16 at 19:06
  • nevermind about tuple concatenation, I just changed it to just update a single list, I have a feeling it could also be changed to not even be recursive but I'm leaving any further optimization to you. – Tadhg McDonald-Jensen May 25 '16 at 19:30
  • @TadhgMcDonald-Jensen - This does the work, thanks for the help ! – verdict_11 May 25 '16 at 19:46