Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.
My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?
How can I do this with python?
I tried the following command:
zless XXX.fastq.gz |split -l 4000000 prefix
(gzip first then split the file)
However, seems it doesn't work with prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:
prefix-aa, prefix-ab...
If my prefix is XXX.fastq.gz
, then the output will be XXX.fastq.gzab
, which will destroy the .fastq.gz format.
So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?