1

I am having two small sequences, which I search in a "long string". If both sequences are found, the key of the "long string" is appended to a list (the string I search IN is a dictionary value).

Now I am looking for a way, to acquire/calculate the distance between the two substrings (if they were found).

So, for example:

String: ABCDEFGHIJKL
sequence1: ABC
sequence2: JKL

I want to get the length of DEFGHI, which would be 6.

Here is my code for finding the substrings, with some "pseudo-codish" idea of what I want (variables start and end). This code does not work (ofc)

def search (myDict, list1, list2):
    # initialize empty list to store found keys
    a=[]
    # iterating through dictionary
    for key, value in myDict.items():
        # if -35nt motif is found between -40 and -20
        for item in thirtyFive:
            if item in value[60:80]:
                start=myDict[:item]
            # it is checked for the -10nt motif from -40 to end
                for item in ten:
                    if item in value[80:]:
                        end=myDict[:item]
                # if both conditions are true, the IDs are
                # appended to the list
                        a.append(key)
    distance=start-end
    return a, distance

Second Idea: So far, I found some stuff on how getting the string between two substrings. So, the next thing I could imagine is, to get the sequence and do sth like len(sequence).

So, I would like to know, if my first idea, to somehow do it while I am finding the small sequences, is somehow possible and, if I am thinking in the right direction with my second idea.

Thanks in advance :)

SOLUTION following @Carlos using str.find method

def search (myDict, list1, list2):
    # initialize empty list to store found keys
    a=[]
    # iterating through dictionary
    for key, value in myDict.items():
        # if -35nt motif is found between -40 and -20
        for item in thirtyFive:
            if item in value[60:80]:
                start=value.find(item)
            # it is checked for the -10nt motif from -20 to end
                for item in ten:
                    if item in value[80:]:
                        end=value.find(item)
                # if both conditions are true, the IDs are
                # appended to the list
                        a.append(key)
                        search.distance=end-start-len(item)

    return a

# calling search function
x=search(d,thirtyFive,ten)
#some other things I need to print
y=len(x)
print(str(x))
print(y)
# desired output
print(search.distance)
Community
  • 1
  • 1
Shushiro
  • 577
  • 1
  • 9
  • 32
  • 1
    Any chance that either of those start/stop strings might occur more than once in your data, and if so, how should that be handled? – Tim Pietzcker Nov 02 '17 at 07:46
  • since I limited the regions, where they are to be found, they should only appear once. I am not searching "any sequence", it has a biological background, so due to the limitation, in my case, the sequences appear only once. If you aim at a general solution, I think you are correct, I am not handling this – Shushiro Nov 02 '17 at 07:51

4 Answers4

3

Check this

In [1]: a='ABCDEFGHIJKL'

In [2]: b='ABC'

In [3]: c='JKL'

In [4]: a.find(b)
Out[4]: 0

In [6]: a.find(c)
Out[6]: 9

In [7]: l=a.find(b) + len(b)

In [8]: l
Out[8]: 3

In [10]: a[l:a.find(c)]
Out[10]: 'DEFGHI'

In [11]: 
Sanket
  • 744
  • 7
  • 22
3

You can also do it using regex :

import re
s = "ABCDEFGHIJKL"
seq1 = "ABC"
seq2 = "JKL"

s1 = re.match(seq1 + "(.*)" + seq2, s).group(1)
print s1
print(len(s1))

Output

DEFGHI
6

OR

Using str.replace :

s2 = s.replace(seq1, '').replace(seq2, '')
print s2
print(len(s2))

Output

DEFGHI
6

Live demo here

Ashish Ranjan
  • 5,523
  • 2
  • 18
  • 39
1

Use str.find() to get two indices, and adjust for the length of the first one.

Also don't forget corner cases, eg where the substrings overlap.

Carlos
  • 5,991
  • 6
  • 43
  • 82
  • thanks, I will dig deeper into that. Dont I prevent overlapping by my if condition? the first sequence should be searched from 60 to 80 ( [60:80]) and if found, the second one is searched for, starting at 80 ([80:]). Or wait, does this overlap one letter? – Shushiro Nov 02 '17 at 07:48
  • @AbhishtaGatya https://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-substring-method – Carlos Nov 02 '17 at 07:49
  • @Shushiro by overlap, I mean if you have the string ABCDEF... and your two queries are ABC and CDE. You then need to decide what he correct output should be. – Carlos Nov 02 '17 at 07:50
1

Solution using regular expressions:

import re

string = "ABCDEFGHIJKL"
sequence1 = "ABC"
sequence2 = "JKL"

result = re.search(sequence1+'(.*)'+sequence2,string)
print(len(result.group(1)))
kingmakerking
  • 2,017
  • 2
  • 28
  • 44