1

Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.

My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?

How can I do this with python?

I tried the following command:

zless XXX.fastq.gz |split -l 4000000 prefix

(gzip first then split the file)

However, seems it doesn't work with prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:

prefix-aa, prefix-ab...

If my prefix is XXX.fastq.gz, then the output will be XXX.fastq.gzab, which will destroy the .fastq.gz format.

So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92
  • It's not clear what you're asking here — can you be more specific? – David Wolever Aug 01 '11 at 20:50
  • 20M means 20 million? bytes? lines? – Janus Troelsen Aug 01 '11 at 20:51
  • How does Python come into play? Just use the `split` command. – pyroscope Aug 01 '11 at 20:52
  • Hi pyroscope I tried split, but my input is .gz form. Seems split doesn't work. – LookIntoEast Aug 01 '11 at 20:56
  • 1
    @LeaTano: Your creation of a new and and the tagging-spree spawned a meta-question: http://meta.stackoverflow.com/questions/288455/is-fastq-really-that-important Try a bit harder to make sure the tag you add add is actually appropriate for the question, and try to clean up other things too. Yes, that will slow you down, but that's not neccessarily a bad thing. – Deduplicator Mar 20 '15 at 22:06

3 Answers3

3

As posted here

zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"
Community
  • 1
  • 1
jimkont
  • 913
  • 1
  • 11
  • 18
2

...I need to unzip it first.

No you don't, at least not by hand. gzip will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.

with gzip.open(infile, 'rb') as inp:
  for <some number of loops>:
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read(slicesize))
  else: # only if you're not sure that you got the whole thing
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read())

Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
0

You can read a gzipped file just like an uncompressed file:

>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
...   process(line)

The process() function would handle the specific line-counting and conditional processing logic that you mentioned.

wberry
  • 18,519
  • 8
  • 53
  • 85