How to read a DNA sequence more efficiently?

Question

I wrote a code in python to read a DNA sequence(to do a motif alignment on them later) however, I'm looking for a more efficient way to do this.

See below if you can help:

handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
    text_file.write(a)

f = 0
z = 100
b = ''
while f < len(a):
    b += a[f:z]+'\n'
    f += 1
    z += 1
with open("2.txt", "w") as runner_mtfs:
   runner_mtfs.write(b)

In summary, I want to do a bunch of analysis on each line of b, but I don't know of any more efficient way to do this, rather than separate each 100 base pairs. The output file is more than 500 megabytes. Any suggestions?

The first part of the code is just a DNA sequence, I'm joining all the lines together, and I'm separating 100 base pairs.

Can you provide a sample input and output of the DNA code you are sequencing? — Shrey, Mar 11 '16 at 04:18
Here is the input: https://www.genome.wisc.edu/pub/sequence/U00096.2.fas and here is the first line of the out put: agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtctgatagcagcttctgaactggttacctgccgtgagtaaat — Antaeus, Mar 11 '16 at 04:19
Strong suggestion: give your variables more descriptive/intelligible names than `a`, `f`, `z`, `b`! — jtbandes, Mar 11 '16 at 04:21
IMO it hurts the readability of the question. Separately: what kind of "analysis" are you doing? Which part of this do you see as "inefficient" and why? — jtbandes, Mar 11 '16 at 04:22
I'm going to apply a convolution filter, and make a matrix from each line, based on the filter, in order to find some patterns. — Antaeus, Mar 11 '16 at 04:23
Check out R & RStudio. I had to do a project like this in school. Our client suggested doing it in R but I wanted to do it in python. In the end it ended up being much more straight forward and faster in R. We took the dna txt files and turned them into csv files, checked for blanks and then wrote the test to preform on them. Then linked them to another file with patient info because the test have to be blind. — lciamp, Mar 11 '16 at 04:42

cge · Answer 1 · 2016-03-11T04:49:36.943

The major problem I see here is that you're writing everything out into a file. There's no point in doing this. The large output file you create is very redundant, and loading it back in when you do your analysis isn't helpful.

After you've loaded the file initially, every window you're interested in looking at is a[x:x+100] for some x. You shouldn't need to actually generate those windows explicitly at all: there shouldn't be any benefit to doing so. Go through, and generate those matrices from each window of a directly.

If you really do need the whole thing, generate it as a numpy array. I additionally, if I'm not using any degenerate base codes, convert the sequence into uint8s using 0,1,2,3 for A, C, G, T. This can help to speed up things, especially if you need to take complements at any point, which can be done as simple fiddling around with bits.

Numpy can generate the array quite efficiently using stride_tricks, as noted in this blog post:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)

Or, converting to ints:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)

This code is around ten times faster than yours on my machine.

It's a good suggestion, (also by just doing the conversion and saving the file in binary I reduced the size of the file majorly) — Antaeus, Mar 11 '16 at 04:45

score 1 · Answer 2 · edited May 23 '17 at 12:23

If it is .fasta-like file, there is a good chance that it contains more than 1 sequence.
There are a lot of examples of reading large files in python on stackoverflow, some useful ways are given here. I usually use recipe given in top answer for that question (with open(...) file). It's fast and it consumes less memory.

It seems that you want to process data with fixed-sized sliding window. I would do it like this:

def load_fasta(fasta_file_name, sliding_window_size = 100):
  buffer = ''
  with open(fasta_file_name) as f:
    for line in f:
      if line.startswith('>'):
        #skip or get some info from comment line
        buffer = ''
      else:
        #read next line
        buffer += line.strip('\r\n')
        offset = 0 # zero-based offset for current string
        while (offset + sliding_window_size <= len(buffer)):
          next_sliding_window = buffer[offset : offset + sliding_window_size]
          yield(next_sliding_window)
          offset += 1
        buffer = buffer[offset : ]

for str in load_fasta("a.fas.txt", 100):
  #do some processing with sliding window data
  print(str)

If you actually want to process portions of data with length less than 100 (or in my example, less than sliding window size), you will have to slightly modify that function (at the appearance of new comment line and at the end of processing).

You can also biopython.

score 1 · Answer 3 · answered Mar 11 '16 at 05:21

Here's a class that does a few things you might want.

"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.

Usage: 100pairs [-n <pairs>] [-p] <file>

<file>                 Input file.
-n,--numpairs <pairs>  Use <pairs> per iteration. [default: 100]
-p,--partial           Allow partial sequences at end of genome.
"""
import docopt

class GeneBuffer:
    def __init__(self, path, bases=100, partial=True):
        self._buf = None
        self.bases = int(bases)
        self.partial = partial
        self.path = path

    def __enter__(self):
        self._file = open(self.path, 'r')
        self._header = next(self._file)
        return self

    def __exit__(self, *args):
        if self._file:
            self._file.close()

    def __iter__(self):
        return self

    def __next__(self):
        if self._buf is None:
            self._buf = ''

        while self._file and len(self._buf) < self.bases:
            try:
                self._buf += next(self._file).strip()
            except StopIteration:
                self._file.close()
                self._file = None
                break

        if len(self._buf) < self.bases:
            if len(self._buf) == 0 or not self.partial:
                raise StopIteration

        result = self._buf[:self.bases]
        self._buf = self._buf[1:]

        return result

def analyze(basepairs):
    """
    Dammit, Jim! I'm a computer programmer, not a geneticist!
    """
    print(basepairs)

def main(args):
    numpairs = args['--numpairs']
    partial = args['--partial']
    with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
        print("Header: ", genome._header)
        count = 0
        for basepairs in genome:
            count += 1
            print(count, end=' ')
            analyze(basepairs)

if __name__ == '__main__':
    args = docopt.docopt(__doc__)
    main(args)

How to read a DNA sequence more efficiently?

3 Answers3