Find and use multiple occurences of a string in a string

Question

I recently started using Python and wrote some simple scripts Now I have this question:

I have this string:

mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'

and I have these following strings:

String_A = BB
String_B = CC

I would like to get all possible combinations of strings starting with String_A and ending with String_B (kind of vague so below is the desired output)

output: 
BBAAABBAAAACCAAACC
BBAAABBAAAACC
BBAAACCAAAACC
BBAAACC

I am able to count the number of occurences of String_A and String_B in mystring using

mystring.count()

And I am able to print out one specific output (the one with the first occurence of String_A and the first occurence of String_B), by doing the following:

if String_A in mystring:
    String_B_End = mystring.index(String_B) + len(String_B)
    output = mystring[mystring.index(String_A); String_B_End]
    print(output)

this works perfect but only gives me the following output:

BBAAABBAAAACC

How can I get all the specified output strings from mystring? thanx in advance!

But your output is not *all possible combinations* – Mazdak Mar 26 '15 at 10:42 — Mazdak, Mar 26 '15 at 10:42

score 1 · Accepted Answer · edited May 23 '17 at 10:24

If I understand the intention of your question correctly you can use the following code:

>>> import re
>>> mystring = 'AAAABBAAABBAAAACCAAAACCAAAA'
>>> String_A = 'BB'
>>> String_B = 'CC'
>>> def find_occurrences(s, a, b):
        a_is = [m.start() for m in re.finditer(re.escape(a), s)] # All indexes of a in s
        b_is = [m.start() for m in re.finditer(re.escape(b), s)] # All indexes of b in s
        result = [s[i:j+len(b)] for i in a_is for j in b_is if j>i]
        return result
>>> find_occurrences(mystring, String_A, String_B)
['BBAAABBAAAACC', 'BBAAABBAAAACCAAAACC', 'BBAAAACC', 'BBAAAACCAAAACC']

This uses the find all occurrences of a substring code from this answer

In its current form the code does not work for overlapping substrings, if mystring = 'BBB' and you look for substring 'BB' it only returns the index 0. If you want to account for such overlapping substrings change the lines where you are getting the indexes of the substrings to a_is = [m.start() for m in re.finditer("(?={})".format(re.escape(a)), s)]

score 0 · Answer 2 · answered Mar 26 '15 at 10:44

Well, first you need to get the indexes of String_A and String_B in the text. See this:

s = mystring
[i for i in range(len(s)-len(String_A)+1) if s[i:i+len(String_A)]==String_A]

it returns [4, 9], i.e. the indexes of 'BB' in mystring. You do similarly for String_B for which the answer would be [15, 21].

Then you do this:

[(i, j) for i in [4, 9] for j in [15, 21] if i < j]

This line combines each starting location with each ending location and ensures that the starting location occurs before the ending location. The i < j would not be essential for this particular example, but in general you should have it. The result is [(4, 15), (4, 21), (9, 15), (9, 21)].

Then you just convert the start and end indices to substrings:

[s[a:b+len(String_B)] for a, b in [(4, 15), (4, 21), (9, 15), (9, 21)]]

Find and use multiple occurences of a string in a string

2 Answers2