3

I am facing a problem with python since a few days. I am a bioinformatics with no basic programming skills and I am working with huge text files (25gb approx.) that I have to process.

I have to read the txt file line-by-line at groups of 4lines per time, which means that the first 4 lines has to be read and processed and then I have to read the second group of 4 lines, and so on.

Obviously I cannot use the readlines() operator because it will overload my memory, and I have to use each of the 4 lines for some string recognition.

I thought about using a for cycle with the range operator:

openfile = open(path, 'r')

for elem in range(0, len(openfile), 4):

line1 = readline()
line2 = readline()
line3 = readline()
line4 = readline()
(process lines...)

Unfortunately this is not possibile because the file in "reading" mode cannot be iterated and treated like a list or a dictionary.

Can anybody please help to cycle this properly?

Thanks in advance

WarioBrega
  • 195
  • 1
  • 3
  • 10
  • In Python files opened for reading can easily be iterated over in a line-oriented manner - see the section on `file.next()` here: http://docs.python.org/library/stdtypes.html?highlight=file.next#file.next – martineau Mar 14 '12 at 19:24

5 Answers5

5

This has low memory overhead. It counts on the fact that a file is an iterator that reads by line.

def grouped(iterator, size):
    yield tuple(next(iterator) for _ in range(size))

Use it like this:

for line1, line2, line3, line4 in grouped(your_open_file, size=4):
    do_stuff_with_lines()

note: This code assumes that the file does not end with a partial group.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
3

You're reading a fastq file, right? You're most probably reinventing the wheel - you could just use Biopython, it has tools for dealing with common biology file formats. For instance see this tutorial, for doing something with fastq files - it looks basically like this:

from Bio import SeqIO
for record in SeqIO.parse("SRR020192.fastq", "fastq"):
    # do something with record, using record.seq, record.id etc

More on biopython SeqRecord objects here.

Here is another biopython fastq-processing tutorial, including a variant for doing this faster using a lower-level library, like this:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")):
    # do things with title,seq,qual values

There's also the HTSeq package, with more deep-sequencing-specific tools, which I actually use more often.

By the way, if you don't know about Biostar already, you could take a look - it's a StackExchange-format site specifically for bioinformatics.

weronika
  • 2,561
  • 2
  • 24
  • 30
2

You could use an infinite loop, and break out of it when you reach the end of the file.

while True:
    line1 = f.readline()
    if not line1:
        break

    line2 = f.readline()
    line3 = f.readline()
    line4 = f.readline()
    # process lines
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
2

There is a method for lazily reading large files in Python here. You can use that approach and process four lines at a time. Note that it is not necessary to perform four read operations, then do your processing and then four read operations again repeatedly. You can read chunks of a few hundred or thousand lines from the file and then process four lines at a time. When you're finished with those lines, you can continue with reading the file's contents.

Community
  • 1
  • 1
Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • Most everything you say is true, but making a multi-line chunk oriented version of the algorithm is easier said than done...especially for someone without basic programming skills. – martineau Mar 14 '12 at 19:09
0

Here is a way of doing it that I can't take credit for but is quite reasonable:

for name, seq, comment, qual in itertools.izip_longest(*[openfile]*4):
    print name
    print seq
    print comment
    print qual
andrew
  • 1,843
  • 20
  • 19