2

I'm new to python, and especially new to multiprocessing/multithreading. I have trouble reading the documentation, or finding a sufficiently similar example to work off of.

The part that I am trying to divide among multiple cores is italicized, the rest is there for context. There are three functions that are defined elsewhere in the code, NextFunction(), QualFunction(), and PrintFunction(). I don't think what they do is critical to parallelizing this code, so I did not include their definitions.

Can you help me parallelize this?

So far, I've looked at https://docs.python.org/2/library/multiprocessing.html

Python Multiprocessing a for loop

and I've tried the equivalents for multithreading, and I've tried ipython.parallel as well.

The code is intended to pull data from a file, process it through a few functions and print it, checking for various conditions along the way.

The code looks like:

def main(arg, obj1Name, obj2Name):

    global dblen

    records   = fasta(refName)

    for i,r in enumerate(records):
        s = r.fastasequence
        idnt = s.name.split()[0]
        reference[idnt] = s.seq
        names[i] = idnt
        dblen += len(s.seq)
        if taxNm == None: taxid[idnt] = GetTaxId(idnt).strip()
    records.close()
    print >> stderr, "Read it"

    # read the taxids
    if taxNm != None:
        file = open(taxNm, "r")
        for line in file:
            idnt,tax = line.strip().split()
            taxid[idnt] = tax
        file.close() 

    File1 = pysam.Samfile(obj1Name, "rb")
    File2 = pysam.Samfile(obj2Name, "rb")

    ***for obj1s,obj2s in NextFunction(File1, File2):
        qobj1 = []
        qobj2 = []
        lobj1s = list(QualFunction(obj1s))
        lobj2s = list(QualFunction(obj2s))
        for obj1,ftrs1 in lobj1s:
            for obj2,ftrs2 in lobj2s:
                if (obj1.tid == obj2.tid):
                    qobj1.append((obj1,ftrs1))
                    qobj2.append((obj2,ftrs2))
        for obj,ftrs in qobj1:
            PrintFunction(obj, ftrs, "1")
        for obj,ftrs in qobj2:
            PrintFunctiont(obj, ftrs, "2")***

    File1.close()
    File2.close()

And is called by

if __name__ == "__main__":
    etc
Community
  • 1
  • 1
  • Your example is a bit confusing, how many iterations will you be doing in the ***for obj1s, obj2s in NextFunction line? Which part of the code is the bottleneck? – kezzos Aug 13 '15 at 19:17
  • Sorry, it doesn't look like the italic/bolding format worked right. Between the stars is the bottleneck. It iterates a variable number of times. NectFunction() contains a StopIteration that allows it to continue. Does that help? – nietzschemouse Aug 13 '15 at 19:30
  • 1
    I think you need to aim at creating a single list which contains all the data you want to offload to different processes and a target function which you want to run in these processes. You may then be able to use multiprocessing.Pool quite easily. For an example see http://stackoverflow.com/questions/20887555/dead-simple-example-of-using-multiprocessing-queue-pool-and-locking. However, if the data is large there will be a penalty for copying data to and from processes. – kezzos Aug 13 '15 at 19:41

0 Answers0