2

I need to process two large files (> 1 billion lines) and split each file into small files based on the information in specific lines in one file. The files record high throughput sequencing data in blocks (we say sequencing reads), while each read contains 4 lines, (name, sequence, n, quality). The read records are in the same order in two files.

to-do

split file1.fq based on the id field in file2.fq,

The two files looks like this:

$ head -n 4 file1.fq
@name1_1
ACTGAAGCGCTACGTCAT
+
A#AAFJJJJJJJJFJFFF

$ head -n 4 file2.fq
@name1_2
TCTCCACCAACAACAGTG
+
FJJFJJJJJJJJJJJAJJ

I wrote the following python function to do this job:

def p7_bc_demx_pe(fn1, fn2, id_dict):
    """Demultiplex PE reads, by p7 index and barcode"""
    # prepare writers for each small files
    fn_writer = {}
    for i in id_dict:
        fn_writer[i] = [open(id_dict[i] + '.1.fq', 'wt'),
            open(id_dict[i] + '.2.fq', 'wt')]

    # go through each record in two files
    with open(fn1, 'rt') as f1, open(fn2, 'rt') as f2:
        while True:
            try:
                s1 = [next(f1), next(f1), next(f1), next(f1)]
                s2 = [next(f2), next(f2), next(f2), next(f2)]
                tag = func(s2) # a function to classify the record
                fn_writer[tag][0].write(''.join(s1))
                fn_writer[tag][1].write(''.join(s2))
            except StopIteration:
                break
    # close writers
    for tag in p7_bc_writer: 
        fn_writer[tag][0].close() # close writers
        fn_writer[tag][1].close() # close writers

Question

Is there any way to speed up this process? (above function is much too slow)

How about split large file into chunks with specific lines (like f.seek()), and run the process in parallel with multiple cores?

EDIT-1

A total of 500 Million reads in each file (~180 GB in size). The bottleneck is reading and writing file. The following is my current solution (it works, but definitely not the best)

I first split the big file into smaller files using shell command: split -l (takes ~3 hours).

Then, apply the functions to 8 small files in parallel (takes ~1 hours)

Finally, merge the results (takes ~ 2 hours)

not trying PySpark yet, thanks @John H

wm3
  • 113
  • 8
  • 2
    If the bottleneck of your program is in reading and writing to files, parallelizing it will not make it much faster unless your hardware support parallel operations (e.g. RAID). – jdehesa Nov 05 '18 at 13:50
  • Yes, I think it is about reading and writing. Do you have any suggestions for this mission? right now, I split the large files into small files first, and process each small file separately in parallel. – wm3 Nov 05 '18 at 13:58
  • 1
    If you've got the resources, try copying the files to `/dev/shm`, perhaps? – UtahJarhead Nov 05 '18 at 15:08
  • I wrote an example of using the multiprocessing module to count lines in a large file. It may offer you some ideas for splitting up the work. https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/6826326#6826326 – Martlark Nov 06 '18 at 04:38

1 Answers1

1

Look into Spark. You can spread your file across a cluster for much faster processing. There is a python API: pyspark.

https://spark.apache.org/docs/0.9.0/python-programming-guide.html

This also gives you the advantage of actually executing Java code, which doesn't suffer from the GIL and allowing for true multi-threading.

John R
  • 1,505
  • 10
  • 18