1

I have several fastq files with 500.000.000 lines (125.000.000 sequences) in average. Is there a fast way to reads these fastq files faster.

What I want to do, is to read each sequence and use the first 16 sequences as barcode. Then count the number of barcode in each file.

Here is my script which takes hours:

import os, errno
from Bio import SeqIO
import gzip
files = os.listdir(".")
for file in files[:]:
    if not file.endswith(".fastq.gz"):
        files.remove(file)

maps = {}
for file in files:
    print "Now Parsing file %s"%file
    maps[file] = {}
    with gzip.open(file,"r") as handle:
        recs = SeqIO.parse(handle,"fastq")
        for rec in recs:
            tag = str(rec.seq)[0:16]
            if tag not in map[file]:
                maps[file][tag] = 1
            else:
                maps[file][tag] += 1

I have 250 GB RAM and 20 CPU that can be used for multi-threading ...

thanks.

m.i.cosacak
  • 708
  • 7
  • 21
  • Have you already benchmarked parsing the fastq file into Pandas as in [this question](https://stackoverflow.com/questions/19436789/biopython-seqio-to-pandas-dataframe)? If that were feasable, then I can think of several ways to simplify this process. – sjc Feb 13 '18 at 20:58

1 Answers1

1

Untested, but here's a way you could do this in 'embarassingly parallel' fashion:

import multiprocessing as mp
import os, errno
from Bio import SeqIO
import gzip

def ImportFile(file):

    maps = {}
    with gzip.open(file,"r") as handle:
        recs = SeqIO.parse(handle,"fastq")
        for rec in recs:
            tag = str(rec.seq)[0:16]
            if tag not in maps.keys():
                maps[tag] = 1
            else:
                maps[tag] += 1

    return {file:maps}


files = os.listdir(".")
for file in files[:]:
    if not file.endswith(".fastq.gz"):
        files.remove(file)

# I'd test this with smaller numbers before using up all 20 cores
pool = mp.Pool(processes=10)
output = pool.map(ImportFile,files)
sjc
  • 1,117
  • 3
  • 19
  • 28
  • thanks. This reads each file seperately. However, still reading one fastq file will take between 45-80 min. Is there a way to read one fastq file with multiprocessing as well to speed up. – m.i.cosacak Feb 13 '18 at 21:20
  • This reads each file _in a separate process in parallel_, so, provided you have the RAM, should only take (45-80 min) * (n_files / processes). If structure it so that you read each single file in some parallel fashion, you still have to do that many times in series. – sjc Feb 13 '18 at 21:22
  • There might be, by using `recs` as the list you're using in `pool.map`, but then you have to deal with message passing between processes so you're not getting duplicate keys in in your dict, and it may not be any faster than this for the task of loading information from _all_ of the files. – sjc Feb 13 '18 at 21:31