Python3: Find length between two substrings of a string

Question

I am having two small sequences, which I search in a "long string". If both sequences are found, the key of the "long string" is appended to a list (the string I search IN is a dictionary value).

Now I am looking for a way, to acquire/calculate the distance between the two substrings (if they were found).

So, for example:

String: ABCDEFGHIJKL
sequence1: ABC
sequence2: JKL

I want to get the length of DEFGHI, which would be 6.

Here is my code for finding the substrings, with some "pseudo-codish" idea of what I want (variables start and end). This code does not work (ofc)

def search (myDict, list1, list2):
    # initialize empty list to store found keys
    a=[]
    # iterating through dictionary
    for key, value in myDict.items():
        # if -35nt motif is found between -40 and -20
        for item in thirtyFive:
            if item in value[60:80]:
                start=myDict[:item]
            # it is checked for the -10nt motif from -40 to end
                for item in ten:
                    if item in value[80:]:
                        end=myDict[:item]
                # if both conditions are true, the IDs are
                # appended to the list
                        a.append(key)
    distance=start-end
    return a, distance

Second Idea: So far, I found some stuff on how getting the string between two substrings. So, the next thing I could imagine is, to get the sequence and do sth like len(sequence).

So, I would like to know, if my first idea, to somehow do it while I am finding the small sequences, is somehow possible and, if I am thinking in the right direction with my second idea.

Thanks in advance :)

SOLUTION following @Carlos using str.find method

def search (myDict, list1, list2):
    # initialize empty list to store found keys
    a=[]
    # iterating through dictionary
    for key, value in myDict.items():
        # if -35nt motif is found between -40 and -20
        for item in thirtyFive:
            if item in value[60:80]:
                start=value.find(item)
            # it is checked for the -10nt motif from -20 to end
                for item in ten:
                    if item in value[80:]:
                        end=value.find(item)
                # if both conditions are true, the IDs are
                # appended to the list
                        a.append(key)
                        search.distance=end-start-len(item)

    return a

# calling search function
x=search(d,thirtyFive,ten)
#some other things I need to print
y=len(x)
print(str(x))
print(y)
# desired output
print(search.distance)

Any chance that either of those start/stop strings might occur more than once in your data, and if so, how should that be handled? — Tim Pietzcker, Nov 02 '17 at 07:46
since I limited the regions, where they are to be found, they should only appear once. I am not searching "any sequence", it has a biological background, so due to the limitation, in my case, the sequences appear only once. If you aim at a general solution, I think you are correct, I am not handling this — Shushiro, Nov 02 '17 at 07:51

score 3 · Answer 1 · answered Nov 02 '17 at 07:51

3

Check this

In [1]: a='ABCDEFGHIJKL'

In [2]: b='ABC'

In [3]: c='JKL'

In [4]: a.find(b)
Out[4]: 0

In [6]: a.find(c)
Out[6]: 9

In [7]: l=a.find(b) + len(b)

In [8]: l
Out[8]: 3

In [10]: a[l:a.find(c)]
Out[10]: 'DEFGHI'

In [11]:

answered Nov 02 '17 at 07:51

Sanket

744
7
22

score 3 · Answer 2 · answered Nov 02 '17 at 07:53

You can also do it using regex :

import re
s = "ABCDEFGHIJKL"
seq1 = "ABC"
seq2 = "JKL"

s1 = re.match(seq1 + "(.*)" + seq2, s).group(1)
print s1
print(len(s1))

Output

DEFGHI
6

OR

Using str.replace :

s2 = s.replace(seq1, '').replace(seq2, '')
print s2
print(len(s2))

Output

DEFGHI
6

Live demo here

score 1 · Accepted Answer · answered Nov 02 '17 at 07:45

1

Use str.find() to get two indices, and adjust for the length of the first one.

Also don't forget corner cases, eg where the substrings overlap.

answered Nov 02 '17 at 07:45

Carlos

5,991
6
43
82

thanks, I will dig deeper into that. Dont I prevent overlapping by my if condition? the first sequence should be searched from 60 to 80 ( [60:80]) and if found, the second one is searched for, starting at 80 ([80:]). Or wait, does this overlap one letter? – Shushiro Nov 02 '17 at 07:48
@AbhishtaGatya https://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-substring-method – Carlos Nov 02 '17 at 07:49
@Shushiro by overlap, I mean if you have the string ABCDEF... and your two queries are ABC and CDE. You then need to decide what he correct output should be. – Carlos Nov 02 '17 at 07:50

kingmakerking · Answer 4 · 2017-11-02T07:59:45.257

1

Solution using regular expressions:

import re

string = "ABCDEFGHIJKL"
sequence1 = "ABC"
sequence2 = "JKL"

result = re.search(sequence1+'(.*)'+sequence2,string)
print(len(result.group(1)))

edited Nov 02 '17 at 07:59

answered Nov 02 '17 at 07:56

kingmakerking

2,017
2
28
44

this surely has SyntaxErrors, when you'll fix them, it'll be same as : https://stackoverflow.com/a/47070184/6518605 – Ashish Ranjan Nov 02 '17 at 07:57
Sure. Didn't see your submission – kingmakerking Nov 02 '17 at 07:59

Python3: Find length between two substrings of a string

SOLUTION following @Carlos using str.find method

4 Answers4