Go to a specific line and read the next few in Python

Question

I have this huge (61GB) FASTQ file of which I want to create a random subset, but which I cannot load into memory. The problem with FASTQs is that every four lines belong together, otherwise I would just create a list of random integers and only write the lines at these integers to my subset file.

So far, I have this:

import random
num = []    
while len(num) < 50000000:
    ran = random.randint(0,27000000)
    if (ran%4 == 0) and (ran not in num):
        num.append(ran)
num = sorted(num)

fastq = open("all.fastq", "r", 4)
subset = open("sub.fastq", "w")
for i,line in enumerate(fastq):
    for ran in num:
        if ran == i:
            subset.append(line)

I have no idea how to reach the next three lines in the file before going to the next random integer. Can someone help me?

You can replace the first half of your code with `random.sample`. — Katriel, Jan 17 '13 at 09:05

score 1 · Answer 1 · edited May 23 '17 at 11:56

1

The idea is that you can sample from a generator without random access, by iterating through it and choosing (or not) each element in turn.

edited May 23 '17 at 11:56

Community

1
1

answered Jan 17 '13 at 09:03

Katriel

120,462
19
136
170

The example you linked for iterating over the file doesn't seem to work with files. – Lilith-Elina Jan 17 '13 at 09:51
@Lilith-Elina the [answer](http://stackoverflow.com/a/434411/398968) works fine for me. What problem do you get? – Katriel Jan 17 '13 at 11:09
Ah, for that answer, I have the problem that izip_longest neither works on my PC nor on our Linux server. – Lilith-Elina Jan 17 '13 at 11:21
Have you imported it from itertools? `from itertools import izip_longest` Alternatively just `import itertools` and then `itertools.izip_longest(...)` – Katriel Jan 17 '13 at 11:22

Thorsten Kranz · Accepted Answer · 2013-01-17T10:09:31.883

0

You could try this:

import random
num = sorted([random.randint(0,27000000/4)*4 for i in range(50000000/4)])

lines_to_write = 0
with open("all.fastq", "r") as fastq:
    with open("sub.fastq", "w") as subset:
        for i,line in enumerate(fastq):
            if len(num)==0:
                break
            if i == num[0]:
                num.pop(0)
                lines_to_write = 4
            if lines_to_write>0:
                lines_to_write -= 1
                subset.write(line)

edited Jan 17 '13 at 10:09

answered Jan 17 '13 at 09:05

Thorsten Kranz

12,492
2
39
56

You need to check if `num` is empty. Also, `i = num[0]` should be `i == num[0]` – Lev Levitsky Jan 17 '13 at 09:16
Won't that stop and throw an error once num is empty but the file has still more lines to iterate over? Ah, I didn't see @LevLevitsky already mentioned that. – Lilith-Elina Jan 17 '13 at 09:42
You both are right. I did this code without trying, and am glad you reviewed it. Now it should (hopefully) work. – Thorsten Kranz Jan 17 '13 at 10:10
It does for my small test files. :-) – Lilith-Elina Jan 17 '13 at 10:11
Great! Btw: How long does it run on a 61 GB file? – Thorsten Kranz Jan 17 '13 at 10:12
I guess about an hour :) Depending on the storage drive, of course. – Lev Levitsky Jan 17 '13 at 10:32
Yes, I would estimate that, too. – Lilith-Elina Jan 17 '13 at 10:35
Only, it takes a lot longer than that... Oh well, at least it seems to work. – Lilith-Elina Jan 18 '13 at 07:38
MAybe you could use a `mmap` to speed things up. Then, use `for i, line in enumerate(iter(mm.readline, "")):`, with `mm` being the mmap. In my experience, this might lead to 10-30% speed boost. – Thorsten Kranz Jan 18 '13 at 07:44

Go to a specific line and read the next few in Python

2 Answers2