2

I have some problem with sequence generator. I have a file where each line contain one fragment (8 letters). I load it from file in to list, where each element is one fragment. It is DNA so it should go that way:

1. Takes first 8-letter element
2. Check for element in which first 7 letters is the same as last 7 letters in first.
3. Add 8th letter from second element in to sequence.

It should look like this:

ATTGCCAT
 TTGCCATA
  TGCAATAC

So sequence: ATTGCCATAC

Unfortunately it only add one element. :( First element is given (we knew it). I do it that way its first in file (first line).

Here is the code:

from os import sys
import random

def frag_get(seqfile):
    frags = []
    f_in = open(seqfile, "r")
    for i in f_in.readlines():
        frags.append(i.strip())
    f_in.close()
    return frags

def frag_list_shuffle(frags):
    random.shuffle(frags)
    return frags

def seq_build(first, frags):
    seq = first
    for f in frags:
        if seq[-7:] == f[:7]:
            seq += f[-1:]
    return seq

def errors():
    pass



if  __name__ == "__main__":
    frags = frag_get(sys.argv[1])
    first = frags[0]
    frags.remove(first)
    frags = frag_list_shuffle(frags)
    seq = seq_build(first, frags)
    check(sys.argv[2], seq)
    spectrum(sys.argv[2], sys.argv[3])

I have deleted check and spectrum functions because it's simple calculations e.g. length comparison, so it is not what cause a problem as I think.

I will be very thankfully for help!

Regards, Mateusz

Mateusz Korycinski
  • 1,037
  • 3
  • 10
  • 24

3 Answers3

3

Because your fragments are shuffled, your algorithm needs to take that into account; currently, you're just looping through the fragments once, which is unlikely to include more than a few fragments if they're not in the right order. For example, say you have 5 fragments, which I'm going to refer to by their order in your sequence. Now the fragments are slightly out of order:

1 - 3 - 2 - 4 - 5

Your algorithm will start with 1, skip 3, then match on 2, adding a base at the end. Then it'll check against 4 and 5, and then finish, never reaching fragment 3.

You could easily fix this by starting your loop again each time you add a base, however, this will scale very badly for a large number of bases. Instead, I'd recommend loading your fragments into a trie, and then searching the trie for the next fragment each time you add a base, until you've added one base for each fragment or you can no longer find a matching fragment.

dataduck
  • 588
  • 4
  • 14
  • Thanks, I will try but I am not very keen on algorithms so I was searching for the easiest way to do it, no for the fastest. Of course it need to be the fastest way. I was also thinking about removing element, which was used, and then do the search again with shorter list. But when I was trying to remove sth from the list during loop, it fails. – Mateusz Korycinski Jan 26 '11 at 17:38
  • See http://stackoverflow.com/questions/4707296/are-there-any-radix-patricia-critbit-trees-for-python/4707555#4707555, specially BioPython. – TryPyPy Jan 26 '11 at 21:51
1

works for me:

>>> seq = "ATTGCCAT"
>>> frags = ["TTGCCATA", "TGCCATAC"]
>>> for f in frags:
...         if seq[-7:] == f[:7]:
...             seq += f[-1:]
... 
>>> seq
'ATTGCCATAC'

You have a spelling error in your example, TGCAATAC should be TGCCATAC. But fixing that it works.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • I should said it in the beginning. It was working fine when elements in list was in wright order. But in real case you don't know if they are so algorithm must work on shuffled list. – Mateusz Korycinski Jan 26 '11 at 17:35
  • @Mateusz K: It will still work, but you may have to go through the list until it's empty or something. But that depends on how your data looks. – Lennart Regebro Jan 26 '11 at 23:35
0

For fun and interest, I've rewritten the problem using OO. See what you think:

import collections
import sys
import random

usage = """
Usage:
    sequence fname expected

Where
    fname:     name of file containing fragments
    expected:  result-string which should be obtained by chaining from first fragment.
"""

class Frag(str):
    MATCHLEN = 7

    def __new__(cls, s=''):
        return str.__new__(cls, s.strip())

    def head(self):
        return Frag(self[:Frag.MATCHLEN])

    def tail(self):
        return Frag(self[Frag.MATCHLEN:])

    def nexthead(self):
        return Frag(self[-Frag.MATCHLEN:])

    def check(self, s):
        return self.__eq__(s)

    def __add__(self, s):
        return Frag(str(self).__add__(s))

class Fraglist(list):
    @classmethod
    def fromFile(cls, fname):
        with open(fname, "r") as inf:
            lst = [Frag(ln) for ln in inf]
        return cls(lst)

    def shuffle(self):
        random.shuffle(self)

class Sequencer(object):
    def __init__(self, seq=None):
        super(Sequencer, self).__init__()
        self.sequences = collections.defaultdict(list)
        if seq is not None:
            for frag in seq:
                self.sequences[frag.head()].append(frag.tail())

    def build(self, frag):
        res = [frag]
        match = frag.nexthead()

        while match in self.sequences:
            next = random.choice(self.sequences[match])
            res.append(next)
            match = (match + next).nexthead()

        return Frag(''.join(res))

def main():
    if len(sys.argv) != 3:
        print usage
        sys.exit(-1)
    else:
        fname = sys.argv[1]
        expected = sys.argv[2]

        frags = Fraglist.fromFile(fname)
        frag1 = frags.pop(0)
        frags.shuffle()

        seq = Sequencer(frags)
        result = seq.build(frag1)

        if result.check(expected):
            print "Match!"
        else:
            print "No match"

if __name__=="__main__":
    main()
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
  • to be honest I'm not so keen in python to understand everything. But it seems to be good. I will talk with my friend, with whom I'm doing that project, to look at it. Thanks so much! ;) – Mateusz Korycinski Jan 28 '11 at 17:25