Trying to find the combination of certain characters in a long string

Question

Given this long string s:

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA

I'm trying to find every occurence of the characters "ATG" and to print the index of the character that comes after every occurence of this combination.

I have already tried looping through the string, and have so far only been successful after finding the first occurrence of the characters "ATG" and to print out the index of the character after that, which is 8. My program however stops after this even though there are more occurrences of the characters "ATG" in the string.

for y in range(len(s)):
    y = s[i : i + 3]
    i = i + 3
    if y == 'ATG':
        print(s.index(y)+3)

In this part of the code 's' is the string. The result is 8 as it finds the first occurance of "ATG" and prints out the index of the character after that. My expected result should be 8, 110, 278, 336 and 340. It would seem the loop stops after finding "ATG" for the first time instead of going all the way through the string until it ends.

Hey :) You managed to pack quite a lot of logical errors in those 5 lines :) I'll address them all in an answer, but until then, I'm curiouse about those things: Why do you use y as your `range` counter variable, and then overwrite it in line 2? What is `y`, why do you do `i = i + 3`, are you programming a second counter variable here, although you already have `y`? Why do you use `y` as the string variable one line later? Why do you do yet another lookup with `.index` later? Why do you add 3 to the lookup? — Finomnis, Jul 02 '19 at 18:18
Got those reasons because of trial and error, did not really have any reasons to do it like that, — Liam Kruize, Jul 02 '19 at 18:58

score 3 · Accepted Answer · edited Jul 02 '19 at 21:30

3

i=0
while True:
    i=s.find("ATG",i)
    if i == -1: break
    i+=3
    print(i)

edited Jul 02 '19 at 21:30

Finomnis

18,094
1
20
27

answered Jul 02 '19 at 18:40

Brendan

81
1

I would have used `while True: ...; i += 3; print(i)`. – wwii Jul 02 '19 at 18:43
@Finomnis, this works, why do you think it is deficient? – wwii Jul 02 '19 at 18:49
Holy crap, it does! This is hillarious, and amazing. I completely misjudged you. I am so sorry. This is actually the best solution, because it is most likely the fastest! (Besides the regex one) Also, I now understand why you suggested the `i += 3`. – Finomnis Jul 02 '19 at 18:53
Also, I downvoted too early and in hindsight without a clear understanding of what you were trying to accomplish. Now I cannot change the vote unless you edit. So, if you incorporate @wwii 's suggestion, you get another upvote ;) – Finomnis Jul 02 '19 at 18:55
Actually if OP intends to find overlapping occurrences of the sub string then my comment about advance by three instead of one is wrong. – wwii Jul 02 '19 at 19:00
@wwii I agree, but as "ATG" is hard coded, I don't think it matters. Also damn, still can't take back my downvote :( sorry! You would deserve it. Take this comment instead. – Finomnis Jul 02 '19 at 19:03

Finomnis · Answer 2 · 2019-07-02T18:39:24.380

2

This should be what you were trying to code:

s = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA"

search_str = 'ATG'
for y in range(len(s)-len(search_str)+1):
    sub_str = s[y : y + 3]
    if sub_str == search_str:
        print(y+len(search_str))

In case you need a one-liner for the fixed string 'ATG', here you go:

res = [n+3 for n in range(len(s)-2) if s[n:n+3] == 'ATG']
print(res)

[8, 110, 278, 336, 340]

edited Jul 02 '19 at 18:39

answered Jul 02 '19 at 18:21

Finomnis

18,094
1
20
27

Thanks a lot! I was struggling with this a lot but your example makes a lot of sense. How did I not think of it earlier.. – Liam Kruize Jul 02 '19 at 18:58
Also, look at the replies of Brendan and Deepstop, they are both faster and better than mine. Mine is just the closest one to what you were trying to do, I think. – Finomnis Jul 02 '19 at 19:02

Deepstop · Answer 3 · 2019-07-02T19:00:54.733

2

Here's a way to do it with regex

import re
helix = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA"

iter = re.finditer(r"ATG", helix)
indices = [m.end(0) for m in iter]
print(indices)

Result is [8, 110, 278, 336, 340]. I found this technique is already in Stack Overflow

Just for fun, recoded this as a function that allows you to specify whether you want overlap (following assumes helix is already defined).

import re

def locate(haystack, needle, overlap=False):
    iter = re.finditer(f'(?=' + needle + r')' if overlap else needle, haystack)
    return [m.end(0) for m in iter]

print(locate(helix, 'ATG'))
print(locate(helix, 'CCC', True))

Result:

[8, 110, 278, 336, 340]
[15, 16, 17, 63, 68, 69, 82, 83, 177, 194, 195, 245, 246, 247, 248, 249, 278, 330]

edited Jul 02 '19 at 19:00

answered Jul 02 '19 at 18:22

Deepstop

3,627
2
8
21

This would be the way I would do it! – Chris Jul 02 '19 at 18:24
Didn't read the instructions properly, the output should be the position *after* the strings, meaning `+3`. – Finomnis Jul 02 '19 at 18:28
1

Odd, but sure. I will update to `m.start(0) + 3` (and yes, if I read questions better I would probably have won more scholarships back in the day). – Deepstop Jul 02 '19 at 18:29
1

@Deepstop Don't use `m.start(0) + 3`. Just use `m.end()`. – blhsing Jul 02 '19 at 18:30
1

@blhsing thank you for the improvement. I've incorporated it. – Deepstop Jul 02 '19 at 18:32
1

Note that this solution does not return the indices of overlapping occurrences, which the intended logic of the OP's code does. – blhsing Jul 02 '19 at 18:36
1

Yes I thought about overlap, but considered that ATG wouldn't overlap in any case so I didn't worry about it. However, say we wanted 'CCC' as the search which overlaps in several places, we could use `iter = re.finditer(r'(?=CCC)', helix)` instead. – Deepstop Jul 02 '19 at 18:45
@blhsing Not sure what the intended logic of OP's code is, code quite unclear :) – Finomnis Jul 02 '19 at 18:47
One more edit just for fun, adding a function that does overlapped or non-overlap. – Deepstop Jul 02 '19 at 18:56
:D lol you are going nuts with this. Why don't you post an external rust solution? :P Appreciate it though, I really like the regex solution. But it isn't something that a beginner would likely adopt. – Finomnis Jul 02 '19 at 19:11

static const · Answer 4 · 2019-07-02T18:25:40.053

0

You are changing the value of y and i. i is not defined for each iteration. What is think you are trying to do is,

idx = 0

while idx < len(s) - 2:
    tempStr = s[idx : idx + 3]
    if tempStr == 'ATG':
        print(s.index(idx)+3)
        idx += 3
    else:
        idx += 1

edited Jul 02 '19 at 18:25

answered Jul 02 '19 at 18:20

static const

953
4
16

`len(s) - 2`, otherwise you'll miss the last element – Finomnis Jul 02 '19 at 18:25
Also, I think `index` is out of place here, why don't you just do `print(idx+3)`? – Finomnis Jul 02 '19 at 19:08

DanielM · Answer 5 · 2019-07-02T19:09:25.263

0

There are several errors in your code, you are using y as index in the for loop and then as the string value.

You are incrementing i by 3, so you are checking occurrences of ATG only on indices 0,3,6,... You want to update the index by 1 at the time (which the for loop does for you) and then change the range so it will be len(s)-2.

for i in range(len(s)-2):
    y = s[i : i + 3] 
    if y == 'ATG':
        print(i+3)

edited Jul 02 '19 at 19:09

answered Jul 02 '19 at 18:21

DanielM

3,598
5
37
53

Will access s[i+2] with i==len(s)-1, meaning s[s.len()+1], which is out of bounds. – Finomnis Jul 02 '19 at 18:22
I changed the range to `range(len(s)-3)` - I'll point it out in the answer itself – DanielM Jul 02 '19 at 18:23
1

`range(len(s)-2)`, otherwise you won't find the last element – Finomnis Jul 02 '19 at 18:24
Also, print(i+3) to match OP's requirements. – Finomnis Jul 02 '19 at 19:08

score 0 · Answer 6 · answered Jul 02 '19 at 18:24

0

For a one liner (modeled after this answer):

>>> res = [n+3 for n in range(len(s)) if s.find('ATG', n) == n]
>>> res
[8, 110, 278, 336, 340]

answered Jul 02 '19 at 18:24

Stephen B

1,246
1
10
23

1

Using the `find` method in a loop over the length of the input unnecessarily makes it cost *O(n ^ 2)* in time complexity. – blhsing Jul 02 '19 at 18:26
A solution with one line may be desired regardless of the time complexity, but it's good to know that it's not as efficient as other looping methods. – Stephen B Jul 02 '19 at 18:28
No. There is better solutions for one line. – Finomnis Jul 02 '19 at 18:28
I am not stopping you from posting that better one line solution :) – Stephen B Jul 02 '19 at 18:29
Done ;) :P and it isn't O(n^2) – Finomnis Jul 02 '19 at 18:32

Trying to find the combination of certain characters in a long string

6 Answers6