0

I’m trying to find a specific sequence in a list containing dna code (letters already converted to numbers E.g. A=1,T=4).

E.g.:

dna = [1,4,3,2,3,2,1,2,3,2,4,2,1,2,2,2,2,4,1,3,4]

Look at the first 3 items (1,4,3) and check if the items are 2,2,4. if True, then get the position (in this 0,1,2 would be False). Else look at the next 3 items 2,3,2 and repeat. Do this for all positions in dna []

My approach was a for i in range loop which should give me the position dna[15,16,17] but it won’t...


A,G,C,U = 1,2,3,4

dna = []

for _ in range(200): #just generated random 200 numbers as example dna
    code = random.randrange(1,5,1)
    dna.append(code)

l = int(len(dna)/3) #splits search into 3

for i in range(l):
    k = i*3
    if dna[k] == 2:
        if dna[k+1] == 2:
            if dna[k+2] == 4:
                m += 1
                print('GGU at:', dna[i], dna[i+1], dna[k+2], 'found:', m)

I have tried so many different ideas from similar questions on SOF but most don’t care for the order of the Numbers... sometimes the pseudo position would be the item 2,2,4, sometimes it wouldn’t find any matches. All help would be appreciated!

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Kiwipo17
  • 23
  • 4
  • 224 can appear in a location that isn't a multiple of 3 – Mad Physicist Aug 11 '19 at 22:39
  • Possible duplicate of [Find starting and ending indices of sublist in list](https://stackoverflow.com/questions/17870544/find-starting-and-ending-indices-of-sublist-in-list) – Mad Physicist Aug 11 '19 at 22:58
  • 1
    [Related](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform). There is a huge literature on the problem; if you are not doing this for fun, you will probably be better of with one of the existing tools. – hilberts_drinking_problem Aug 11 '19 at 23:05
  • Another quick solution would be to convert `dna` and the subpattern to `strings` and run `re.finall`. This will be efficient if you are testing for only a handful of patterns. – hilberts_drinking_problem Aug 11 '19 at 23:09
  • I can answer the [other question](https://stackoverflow.com/q/62721006/176769) if you still need help. – karlphillip Jul 03 '20 at 18:41

1 Answers1

1

Use some kind of a generator to split the code into chunks of length 3 and compare the to [2,2,4] in a for loop

import random
A,G,C,U = 1,2,3,4

dna = []

for _ in range(200): #just generated random 200 numbers as example DNA
    code = random.randrange(1,5,1)
    dna.append(code)
#function to split list into chunks
def get_chunks(li, cols=2):
    start = 0
    for i in range(cols):
        stop = start + len(li[i::cols])
        yield li[start:stop]
        start = stop
#calculate the required amount of chunks
chunk_amount = int(len(dna)/3)
#create a generator that returns the chunks
chunk_generator = get_chunks(dna, chunk_amount)
#write the chunks to a list named chunks
chunks = []
for x in chunk_generator:
    chunks.append(x)

#iterate the chunks to find a match
j = 0
for i in chunks:
    if i == [2, 2, 4]: #was the sequence found?
        print("2, 2, 4 located at " + str(j) + ", " + str(j+1) + ", " + str(j+2))
    j += 3

This program finds all [2,2,4] instances in the random DNA. Of course it doesn't print anything if no [2,2,4] sequence exists

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jake
  • 2,090
  • 2
  • 11
  • 24
  • Thank you so much! this was already very helpful! I'm not sure what the location is trying to tell me as of yet (what + str(j) + str(j+1) + str(j+0) ACTUALLY outputs). Could you @Jake maybe elaborate on this a little bit further? Thank you so much! It prints: `2, 2, 4 located at 818281 2, 2, 4 located at 135136135 2, 2, 4 located at 153154153` – Kiwipo17 Aug 12 '19 at 10:14
  • Fixed it. I did some sloppy coding and just concatenated the 3 next values without paying attention to formatting – Jake Aug 12 '19 at 11:42
  • Thank you so much Jake!! – Kiwipo17 Aug 13 '19 at 15:23