How to optimize code for a file with 72 million entries?

Question

I have asked a question on another platform (here) - it would be great to get your input in order to make my Python code run in a very short time. Currently, it has been taking more than 3 hours for a file with millions of entries.

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                dicts1[i]=x

        #write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
        for k,v in dicts1.items():
            if v not in dicts2.values():
                dicts2[k] = v

        #making a list
        for keys in dicts2:
            lst.append(keys)

    #print(dicts1)
    #print(dicts2)
    #print(lst)


    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
            if record.id in lst:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                #print(record.seq)
                #print(miRNAseq.seq)
                #print(adapter_seq.seq)
                #print(umi_seq.seq)
                #print("@"+record.id)
                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
                    print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
                    cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
                    print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

if __name__ == '__main__':
    QIAseq_UMI_correction()

Ask that question again here. Links to off-site questions can rot and make your question useless to future askers — Pranav Hosangadi, Feb 21 '22 at 20:12
Also, have you profiled your code? Can you narrow down which parts take the longest to run, and would affect your runtime the most if you optimized them? — Pranav Hosangadi, Feb 21 '22 at 20:12
this question is too open-ended as-written for the site - however, some solution shape like a database with schema useful to you or just sorting the file (such that you can look up by-index from a hashmap or directly bisect for it) is probably what you're after — ti7, Feb 21 '22 at 20:15
@PranavHosangadi I think the dicts1 and dicts2 are the ones causing this long time run. Also searching the ```record.id in last``` can cause it. the length of the dicts1 should be around 70 million (keys and values) thus this is a bottle neck. — Apex, Feb 21 '22 at 20:19
Do you need to run the for loop twice? Do you have a lot of ram to spare, possibly have the file in a ram drive, or even an ssd? Can you split the file into smaller files and run your code in parallel? — Roeften, Feb 21 '22 at 20:20
@PranavHosangadi What I generally do is roughly explained in the link provided in the question. — Apex, Feb 21 '22 at 20:20
@Roeften Unfortunately I am not a that advanced coder but yes my system is good - the system that I use is MacBook Pro (2019) core i9 with 32 GB of memory. — Apex, Feb 21 '22 at 20:24
@Apex you can start by removing the second reading/parsing of the file, that will save you 1.5 hours I expect ;-) You are reading and parsing and looping through 2 times. — Roeften, Feb 21 '22 at 20:30
@Roeften I removed it until the if statement but then the output is wrong - maybe I am missing something? — Apex, Feb 21 '22 at 20:31
@Roeften also I have been running the code on a file with 70 million entries for 3 hours now nad there has not been any output yet. I guess that the dictionaries are the bottlenecks — Apex, Feb 21 '22 at 20:32
Eliminate some of the guesswork — see [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) — martineau, Feb 21 '22 at 21:21
If you want to squeeze even more speed, the I suggest to get rid of the `SeqIO` (As this creates two heavy objects (`Seq` and `SeqRecord`) for each entry), and use `SimpleFastaParser` from `Bio.SeqIO.FastaIO`. Since you are using only find and join, there should be no problem in altering the code.. — Marek Schwarz, Feb 22 '22 at 07:55
@MarekSchwarz actually I need to keep the fastq format because at the end of the day I need to have a fast format with all four lines per entry and then I can input that file to another tool for downstream analysis. Otherwise, I do agree that Fsta is much faster. — Apex, Feb 22 '22 at 09:22
Oh, sorry, I've misread fasta vs fastq. On the other hand, my point stands, as there is low-level `fastq` parser called `FastqGeneralIterator` (in `SeqIO.QualityIO`) which returns `ID, seq, qual' tuples. — Marek Schwarz, Feb 22 '22 at 12:45

Roeften · Answer 1 · 2022-02-21T21:29:49.067

Why not have one reading, parsing and looping of the file? I have moved the code of the second loop to the first, am I missing something? Why loop twice?

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    sentinel = 100

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):

            # only for testing
            if sentinel < 0:
                break
            sentinel -= 1

            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))

                if x not in dicts2:
                    if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                    if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
                        cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

                dicts1[i]=x
                dicts2[x]=i


if __name__ == '__main__':
    QIAseq_UMI_correction()

Other suggestions:

As mentioned in comments you could time major steps to see where time can be shortened. Check out timeit for example. My suggestion is to time the SeqIO.parse, if "AACTGTAGGCACCATCAAT" in record.seq:

I suspect that a major chunk of time is spent in parsing with SeqIO.parse so your should use this once ideally.

A final suggestion is to use a smaller set of records until you have your code ready with what you need it to do. I have added a sentinel variable as an example to break out of the loop when 100 matching records have been explored.

thank you for your input but I get this error ```for k,v in dicts1.items(): ^ SyntaxError: invalid syntax``` — Apex, Feb 21 '22 at 20:40
@Apex yeah sorry, I pasted in a wrong way. Fixed it, but check the formatting in any case. I only wanted to show you that if you move the print stuff to the first loop you can get rid of the second and possibly halve the time of execution. — Roeften, Feb 21 '22 at 20:44
now it works but prints all entries. With my own code what I do in the first part is that I do take IDs and sliced sequences for each entry and put them in a dict (dicts1) and with dicts2 I save only one ID for repeated sequences in dicts1 then I append those to the list and based on that list I go through the file again (the second part) and print out the entries in another sliced version. — Apex, Feb 21 '22 at 20:57
@Apex you can solve that in the same loop by checking the dict, check the updated code, if you can functionally do all tasks in one go then this will save you the most time in my opinion. — Roeften, Feb 21 '22 at 21:18
@Apex Now i see what you mean. Would the updated code work? That gets rid of more loops and list. — Roeften, Feb 21 '22 at 21:32
Thank you a lot for your help - I also changed the code by removing the second part as you did but then checking if ```x``` is not present in a dict then writes ```miRNAseq+adapter_seq+umi_seq``` in output. I only use one dict now and the code is already generating output :) -- I am happy to share the code with you if you want. Do you know where I can share it? — Apex, Feb 21 '22 at 21:49
@Apex Glad you got it sorted out. I am not into bio stuff but you can share your project on Github if you want others to use it. — Roeften, Feb 21 '22 at 21:54
I just pasted the code in another platform that I shared the link in the question. you can check it out :) — Apex, Feb 21 '22 at 21:57

kpie · Answer 2 · 2022-02-22T18:06:14.120

The things that I see on a first pass are these:

First where you check

if "AACTGTAGGCACCATCAAT" in record.seq:
    adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')

you can use the following to avoid searching through the sequence twice.

adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
if adapter_pos != -1: # check that sequence is found

The lines:

i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x

can be changed to:

dicts1[record.id]=miRNAseq.seq+umi_seq.seq

next the lines:

for k,v in dicts1.items():
    if v not in dicts2.values():
        dicts2[k] = v

can be changed to

dict2 = {**dict1,**dict2}

However the that change might actually come at a performance cost I'm not sure.

Next is that

for keys in dicts2:
    lst.append(keys)

can be deleted and

lst = list(dicts2.keys())

can be added outside and after the first pass through (where you have the commented out print statements.)

Finally as @Roeften suggests you can put the second chunk of code in with the first to avoid going through the whole file twice.

At least some of these suggestions will help but in reality python just isn't that fast of a language and if you want to do this sort of analysis regularly you might consider using something faster.

what does if adapter_pos != -1: means ? -1 is for not found ? — pippo1980, Feb 22 '22 at 17:49
thanks got it, made a small try by myself. If you add a notation to your code like if adapter_pos != -1: # check that sequence is found or something more correct, I'll be happy to upvote your answer, being a novice to Biopython such a notation could have helped to understand your example — pippo1980, Feb 22 '22 at 18:02

score 1 · Answer 3 · answered Feb 24 '22 at 09:50

1

I would use partition-method here to your own answer.

from Bio import SeqIO
import sys

def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts = {}
    lst = []

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            
            miRNAseq, adapter_seq, umi_seq = str(record.seq).partition("AACTGTAGGCACCATCAAT")

            if adapter_seq:
                x = miRNAseq.seq+umi_seq.seq
                y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                dicts[miRNAseq + umi_seq] = record.id

                    if len(y) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                    else:
                        cut = len(y) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
    QIAseq_UMI_correction()

answered Feb 24 '22 at 09:50

Vovin

720
4
16

By the way, if it is not so necessary I would get rid of print because it takes a lot of time. – Vovin Feb 24 '22 at 09:53
Thank you for your answer - but the problem is that I need to save the slices inside ```record[]``` otherwise ```.seq``` will not operate. I tried to run your script and this error popped up ```AttributeError: 'str' object has no attribute 'seq'``` – Apex Feb 24 '22 at 10:39
@Apex Yes, I understand you. Unfortunately, there are a lot of string methods inside Seq class besides partition. So I would recommend to use Seq(sequence_string) before saving. I could answer clearer if you show your code, especially what you want to save specifically, and for what aims. – Vovin Feb 24 '22 at 10:47
the input file is in Fastq format (4 lines per entry) and in the output after slicing the second line I need to have the four lines per entry again. – Apex Feb 24 '22 at 10:49

score 0 · Accepted Answer · answered Feb 22 '22 at 11:04

I solved it by removing the second part - then checking if x is not present in the dicts.keys() (searching for dicts.values() was much slower and was the real bottleneck) then writing y in output. I only use one dict now and the code is already generating output. Here is the updated version.

from Bio import SeqIO
import sys

def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts = {}
    lst = []
    
    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                if x not in dicts.keys():
                    dicts[x]=i

                    if len(y) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
                    
                    if len(y) > 50:
                        cut = len(y) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
    QIAseq_UMI_correction()

How to optimize code for a file with 72 million entries?

4 Answers4