I have this huge (61GB) FASTQ file of which I want to create a random subset, but which I cannot load into memory. The problem with FASTQs is that every four lines belong together, otherwise I would just create a list of random integers and only write the lines at these integers to my subset file.
So far, I have this:
import random
num = []
while len(num) < 50000000:
ran = random.randint(0,27000000)
if (ran%4 == 0) and (ran not in num):
num.append(ran)
num = sorted(num)
fastq = open("all.fastq", "r", 4)
subset = open("sub.fastq", "w")
for i,line in enumerate(fastq):
for ran in num:
if ran == i:
subset.append(line)
I have no idea how to reach the next three lines in the file before going to the next random integer. Can someone help me?