Split large .gz files with prefixes

Question

Each of my fastq files is about 20 millions reads (or 20 millions lines). Now I need to split the big fastq files into chunks, each with only 1 million reads (or 1 million lines), for the ease of further analysis. fastq file is just like .txt.

My thought is, just count the line, and print out the lines after counting every 1 million lines. But the input file is .gz compressed form (fastq.gz), do I need to unzip first?

How can I do this with python?

I tried the following command:

zless XXX.fastq.gz |split -l 4000000 prefix

(gzip first then split the file)

However, seems it doesn't work with prefix (I tried) "-prefix", still it doesn't work. Also, with split command the output is like:

prefix-aa, prefix-ab...

If my prefix is XXX.fastq.gz, then the output will be XXX.fastq.gzab, which will destroy the .fastq.gz format.

So what I need is XXX_aa.fastq.gz, XXX_ab.fastq.gz (ie. suffix). How can I do that?

It's not clear what you're asking here — can you be more specific? — David Wolever, Aug 01 '11 at 20:50
How does Python come into play? Just use the `split` command. — pyroscope, Aug 01 '11 at 20:52
Hi pyroscope I tried split, but my input is .gz form. Seems split doesn't work. — LookIntoEast, Aug 01 '11 at 20:56
@LeaTano: Your creation of a new and and the tagging-spree spawned a meta-question: http://meta.stackoverflow.com/questions/288455/is-fastq-really-that-important Try a bit harder to make sure the tag you add add is actually appropriate for the question, and try to clean up other things too. Yes, that will slow you down, but that's not neccessarily a bad thing. — Deduplicator, Mar 20 '15 at 22:06

score 3 · Answer 1 · edited May 23 '17 at 11:50

3

As posted here

zcat XXX.fastq.gz | split -l 1000000 --additional-suffix=".fastq" --filter='gzip > $FILE.gz' - "XXX_"

edited May 23 '17 at 11:50

Community

1
1

answered Jul 10 '14 at 08:14

jimkont

913
1
11
18

Ignacio Vazquez-Abrams · Answer 2 · 2015-03-20T23:45:15.980

2

...I need to unzip it first.

No you don't, at least not by hand. gzip will allow you to open the compressed file, at which point you read out a certain number of bytes and write them out to a separate compressed file. See the examples at the bottom of the linked documentation to see how to both read and write compressed files.

with gzip.open(infile, 'rb') as inp:
  for <some number of loops>:
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read(slicesize))
  else: # only if you're not sure that you got the whole thing
    with gzip.open(outslice,'wb') as outp:
      outp.write(inp.read())

Note that gzip-compressed files are not random-accessible so you will need to perform the operation in one go unless you want to decompress the source file to disk first.

edited Mar 20 '15 at 23:45

answered Aug 01 '11 at 20:49

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

1

No I don't need to read the file....(20 million lines!!) just for further analysis – LookIntoEast Aug 01 '11 at 20:56
2

The link. The link in my answer. The link in blue. IN BLUE. – Ignacio Vazquez-Abrams Aug 01 '11 at 20:57
1

sorry i didn't see it's a link...hahaha – LookIntoEast Aug 01 '11 at 21:01
thx ignacio. but after gzip, how can I count the line and split the file? – LookIntoEast Aug 01 '11 at 21:11
You probably want [`enumerate()`](http://docs.python.org/library/functions.html#enumerate) to count the lines. And [`file()`](http://docs.python.org/library/functions.html#file). – Ignacio Vazquez-Abrams Aug 01 '11 at 21:13
This doesn't actually answer the question and would better serve as a comment. – JasonMArcher Mar 20 '15 at 20:25

score 0 · Answer 3 · answered Aug 01 '11 at 21:01

0

You can read a gzipped file just like an uncompressed file:

>>> import gzip
>>> for line in gzip.open('myfile.txt.gz', 'r'):
...   process(line)

The process() function would handle the specific line-counting and conditional processing logic that you mentioned.

answered Aug 01 '11 at 21:01

wberry

18,519
8
53
85

thx....but I can't find process() function by google....would you mind giving a link about the usage of process()? thx – LookIntoEast Aug 01 '11 at 21:10
1

@wang: It's a function that you write yourself to process each line in turn. – Ignacio Vazquez-Abrams Aug 01 '11 at 21:12

Split large .gz files with prefixes

3 Answers3