4

I have a text file, and I have a condition set up where I need to extract a chunk of text every other line, but the chunk of text can be any amount of lines (a FASTA file, for any bioinformatics people). It's basically set up like this:

> header, info, info
TEXT-------------------------------------------------------
----------------------------------------------------
>header, info...
TEXT-----------------------------------------------------

... and so forth.

I am trying to extract the "TEXT" part. Here's the code I have set up:

for line in ffile:
    if line.startswith('>'):

      # do stuff to header line

        try:
            sequence = ""
            seqcheck = ffile.next() # line after the header will always be the beginning of TEXT
            while not seqcheck.startswith('>'):
                        sequence += seqcheck
                        seqcheck = ffile.next()

        except:       # iteration error check
            break

This doesn't work, because every time I call next(), it continues the for loop, which results in me skipping a lot of lines and losing a lot of data. How can I just "peek" into the next line, without moving the iterator forward?

Manuel Allenspach
  • 12,467
  • 14
  • 54
  • 76
biohax2015
  • 281
  • 2
  • 6
  • 15
  • 1
    Why do you have that inner loop at all? `if line.startswith(">"): [do header stuff] else: [do text stuff]` – tobias_k Jun 04 '14 at 18:40

5 Answers5

3

I guess if you would check that data doesn't starts with '>' would be a lot easier.

>>> content = '''> header, info, info
... TEXT-------------------------------------------------------
... ----------------------------------------------------
... >header, info...
... TEXT-----------------------------------------------------'''
>>> 
>>> f = StringIO(content)
>>> 
>>> my_data = []
>>> for line in f:
...   if not line.startswith('>'):
...     my_data.append(line)
... 
>>> ''.join(my_data)
'TEXT-------------------------------------------------------\n----------------------------------------------------\nTEXT-----------------------------------------------------'
>>> 

Update:

@tobias_k this should separate lines:

>>> def get_content(f):
...   my_data = []
...   for line in f:
...     if line.startswith('>'):
...       yield my_data
...       my_data = []
...     else:
...       my_data.append(line)
...   yield my_data  # the last on
... 
>>> 
>>> f.seek(0)
>>> for i in get_content(f):
...   print i
... 
[]
['TEXT-------------------------------------------------------\n', '----------------------------------------------------\n']
['TEXT-----------------------------------------------------']
>>> 
Vor
  • 33,215
  • 43
  • 135
  • 193
1

Have you considered a regex?:

txt='''\
> header, info, info
TEXT----------------------------------------------------------------
TEXT2-------------------------------------------
>header, info...
TEXT-----------------------------------------------------'''


import re

for header, data in ((m.group(1), m.group(2)) for m in re.finditer(r'^(?:(>.*?$)(.*?)(?=^>|\Z))', txt, re.S | re.M)):
    # process header
    # process data
    print header, data

See this work

That will give you your header and data from that header in a tuple to do what you need to do with it.


If your file is huge, you can use mmap to avoid having to read the entire file into memory.

Community
  • 1
  • 1
dawg
  • 98,345
  • 23
  • 131
  • 206
0

Here's another approach. Contrary to my above comment, this does use a nested loop to collect all the lines belonging to one text block (so the logic for this is not so spread-out), but does so slightly differently:

for line in ffile:
    if not line.startswith('>'):
        sequence = line
        for line in ffile:
            if line.startswith('>'): break
            sequence += line
        print "<text>", sequence
    if line.startswith('>'):
        print "<header>", line

First, it uses a second for loop (using the very same ffile iterator as the outer loop), so there's no need for try/except. Second, no lines are lost, because we feed the current line into the sequence, and because we do the non-header case first: At the time the second if check is reached, the line variable will hold the header line at which the nested loop stopped (don't use else here, or this won't work).

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • Update: If you want to process the text lines together with the preceeding header, you could store the header in a variable, like `lastHeader`, and use that in the text-case. (Similar to FoffT's answer, but the other way around, and still having the advantage that you have your text-processing logic concentrated in just one place.) – tobias_k Jun 04 '14 at 19:08
  • So the second loop will pick up where the first one left off? – biohax2015 Jun 05 '14 at 01:27
  • @biohax2015 Since they are using the same iterator, yes. But the next header line will already be consumed by the inner loop, that's why I am using the same name for the variable -- `line` -- and putting the header-case second, so it will check that line produced by the inner loop. – tobias_k Jun 05 '14 at 07:44
0

My recommendation for peeking is to use a list and enumerate:

lines = ffile.readlines()
for i, line in enumerate(lines):
    if line.startswith('>'):
        sequence = ""
        for l in lines[i+1:]:
            if l.startswith('>'):
                break
            sequence += l
otus
  • 5,572
  • 1
  • 34
  • 48
0

Here's a method with very little change to your original code. It depends on your situation, but sometimes it's easier to just do what you want to do and not have to worry about re-organizing / refactoring everything else! If you want to push something BACK so it gets iterated out again, then just make it so you can!

Here we instantiate a deque() object which holds previously read lines. We then wrap the ffile iterator which does a simple check of the object and drains the entries in it before getting new lines from ffile.

So whenever we read something that needs reprocessing somewhere else, append it to the deque object and break out.

import cStringIO,collections
original_ffile=cStringIO.StringIO('''
> header, info, info
TEXT----------------------------------------------------------------
TEXT2-------------------------------------------
>header, info...
TEXT-----------------------------------------------------''')

def peaker(_iter,_buffer):
    popleft=_buffer.popleft
    while True:
        while _buffer: yield popleft() # this implements FIFO-style
        yield next(_iter) # we don't have to catch StopIteration here!
buf=collections.deque()
push_back=buf.append
ffile=peaker(original_ffile,buf)
for line in ffile:
    if line.startswith('>'):
        print "found a header! %s"%line[:-1]
        # do stuff to header line
        sequence = ""
        for seqcheck in ffile:
            if seqcheck.startswith('>'):
                print "oops, we've gone too far, pushing back: %s"%seqcheck[:-1]
                push_back(seqcheck)
                break
            sequence += seqcheck

Output:

found a header! > header, info, info
oops, we've gone too far, pushing back: >header, info...
found a header! >header, info...
parity3
  • 643
  • 9
  • 18