Read lines from HUGE text files at groups of 4

Question

I am facing a problem with python since a few days. I am a bioinformatics with no basic programming skills and I am working with huge text files (25gb approx.) that I have to process.

I have to read the txt file line-by-line at groups of 4lines per time, which means that the first 4 lines has to be read and processed and then I have to read the second group of 4 lines, and so on.

Obviously I cannot use the readlines() operator because it will overload my memory, and I have to use each of the 4 lines for some string recognition.

I thought about using a for cycle with the range operator:

openfile = open(path, 'r')

for elem in range(0, len(openfile), 4):

line1 = readline()
line2 = readline()
line3 = readline()
line4 = readline()
(process lines...)

Unfortunately this is not possibile because the file in "reading" mode cannot be iterated and treated like a list or a dictionary.

Can anybody please help to cycle this properly?

Thanks in advance

In Python files opened for reading can easily be iterated over in a line-oriented manner - see the section on `file.next()` here: http://docs.python.org/library/stdtypes.html?highlight=file.next#file.next — martineau, Mar 14 '12 at 19:24

Steven Rumbalski · Answer 1 · 2012-03-14T19:14:34.620

5

This has low memory overhead. It counts on the fact that a file is an iterator that reads by line.

def grouped(iterator, size):
    yield tuple(next(iterator) for _ in range(size))

Use it like this:

for line1, line2, line3, line4 in grouped(your_open_file, size=4):
    do_stuff_with_lines()

note: This code assumes that the file does not end with a partial group.

edited Mar 14 '12 at 19:14

answered Mar 14 '12 at 18:50

Steven Rumbalski

44,786
9
89
119

score 3 · Answer 2 · answered Apr 03 '12 at 23:06

You're reading a fastq file, right? You're most probably reinventing the wheel - you could just use Biopython, it has tools for dealing with common biology file formats. For instance see this tutorial, for doing something with fastq files - it looks basically like this:

from Bio import SeqIO
for record in SeqIO.parse("SRR020192.fastq", "fastq"):
    # do something with record, using record.seq, record.id etc

More on biopython SeqRecord objects here.

Here is another biopython fastq-processing tutorial, including a variant for doing this faster using a lower-level library, like this:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")):
    # do things with title,seq,qual values

There's also the HTSeq package, with more deep-sequencing-specific tools, which I actually use more often.

By the way, if you don't know about Biostar already, you could take a look - it's a StackExchange-format site specifically for bioinformatics.

score 2 · Answer 3 · answered Mar 14 '12 at 18:32

2

You could use an infinite loop, and break out of it when you reach the end of the file.

while True:
    line1 = f.readline()
    if not line1:
        break

    line2 = f.readline()
    line3 = f.readline()
    line4 = f.readline()
    # process lines

answered Mar 14 '12 at 18:32

Mark Byers

811,555
193
1,581
1,452

score 2 · Accepted Answer · edited May 23 '17 at 11:45

2

There is a method for lazily reading large files in Python here. You can use that approach and process four lines at a time. Note that it is not necessary to perform four read operations, then do your processing and then four read operations again repeatedly. You can read chunks of a few hundred or thousand lines from the file and then process four lines at a time. When you're finished with those lines, you can continue with reading the file's contents.

edited May 23 '17 at 11:45

Community

1
1

answered Mar 14 '12 at 18:38

Simeon Visser

118,920
18
185
180

Most everything you say is true, but making a multi-line chunk oriented version of the algorithm is easier said than done...especially for someone without basic programming skills. – martineau Mar 14 '12 at 19:09

score 0 · Answer 5 · answered Dec 06 '13 at 21:55

0

Here is a way of doing it that I can't take credit for but is quite reasonable:

for name, seq, comment, qual in itertools.izip_longest(*[openfile]*4):
    print name
    print seq
    print comment
    print qual

answered Dec 06 '13 at 21:55

andrew

1,843
20
19

Read lines from HUGE text files at groups of 4

5 Answers5