2

I'm working on human genome which consists of 3.2 billions of characters and i have a list of objects which need to be searched within this data. Something like this:

result_final=[]
objects=['obj1','obj2','obj3',...]

def function(obj):
    result_1=search_in_genome(obj)
    return(result_1)

for item in objects:
    result_2=function(item)
    result_final.append(result_2)

Each object's search within the data takes nearly 30 seconds and i have few thousands of objects. I noticed that while doing this serially just 7% of CPU and 5% of RAM is being used. As i searched, for reducing the computation time i should do parallel computation using queuing , threading or multiprocessing. but they seem complicated for non-experts. could anybody help me how i can code for python to run 10 simultaneous searches and is it possible to make python to use maximum available CPU and RAM for multiprocessing? (I'm using Python33 on windows 7 with 64Gb RAM,COREI7 and 3.5 GH CPU)

Masih
  • 920
  • 2
  • 19
  • 36
  • Uses ``concurrent.futures``, specially ``concurrent.futures.as_completed`` theres an implementation in PyPi if you're using Python 2.7. – Apalala Jul 22 '14 at 01:11

2 Answers2

3

You can use the multiprocessing module for this:

from multiprocessing import Pool

objects=['obj1','obj2','obj3',...]

def function(obj):
    result_1=search_in_genome(obj)
    return(result)


if __name__ == "__main__":
    pool = Pool()
    result_final = pool.map(function, objects)

This will allow you to scale the work across all available CPUs on your machine, because processes aren't affected by the GIL. You wouldn't want to run too many more tasks than there are CPUs available. Once you do that, you actually start slowing things down, because then the CPUs have to constantly switch between processes, which has a performance penalty.

dano
  • 91,354
  • 19
  • 222
  • 219
  • By the way, is this method better than create multi-thread as I do, or is this just in case of bytecode ? – katze Jul 21 '14 at 14:24
  • @katze Because of the [Global Interpreter Lock (GIL)](http://stackoverflow.com/q/265687/2073595), only one Python thread can run CPU-bound operations at a time. The only threads can truly run concurrently is if one of them is doing an I/O operation (reading from disk, listening to a socket, etc.). The OP is doing a search on objects in memory, each means there's no disk I/O going on, and the searching is CPU-bound. That means threads can't really parallelize the task. The `multiprocessing` module distributes the work between processes, so it isn't affected by the GIL. – dano Jul 21 '14 at 15:24
  • 1
    @katze Basically, if you need to parallelize I/O-bound work, threads are a good choice. If you need to parallelize CPU-bound work, you need to use multiple processes. – dano Jul 21 '14 at 15:25
  • @katze@dano Thanks for answers. @dano since you are an expert, do you have any reference which shows the clear difference between I/O-bound work and CPU-bound work? – Masih Jul 21 '14 at 16:08
  • 1
    @user3015703 If you want to compare how threads perform vs. processes, you can literally change the first line of my example to this: `from multiprocessing.pool import ThreadPool as Pool`, and you'll have a thread pool instead of a process pool. You don't need to change anything else. Feel free to try both ways and see how performance is affected. – dano Jul 21 '14 at 16:13
  • @user3015703 You can try to time it, but usually it's just based on an eye test; if you're thread is spending most of it's time waiting for some data to come over a socket, does some brief processing of that data, and then waits again, it's I/O bound. If it reads from a file and then does several seconds of processing on the contents of that file, it's CPU-bound. Generally, I would recommend using `multiprocessing` with Python if you're doing anything but completely trivial CPU operations. The GIL hurts performance, even for I/O bound threads. – dano Jul 22 '14 at 03:42
  • 2
    @user3015703 If you're interested in learning more about how the GIL affects performance, I highly recommend watching [this video](http://pycon.blip.tv/file/3254256/). It's from a PyCon talk called "Understanding the GIL", and it contains a lot of really useful and interesting information about how the GIL affects performance of CPU and I/O bound threads, and how the affect changes based on number of threads and CPU cores in use. – dano Jul 22 '14 at 03:44
  • so how can one tell if some code is I/O bound or CPU-bound work? i.e. how can i determine whether the speed of processing of some data is more than speed of it's I/O?! for example in here :http://stackoverflow.com/questions/16199793/python-3-3-simple-threading-event-example , time has been reduced from 2 seconds to 0.5 using threading, how could one say that this is a I/O bound work? – Masih Jul 22 '14 at 03:46
  • @user3015703 The `do_work` method there is a `time.sleep(.1)` call, which releases the GIL - meaning it simulates I/O-bound work. A comment in the code even mentions this: `# With 4 threads should be about .5 seconds (contrived because non-CPU intensive "work") ` – dano Jul 22 '14 at 04:04
  • @dano Awesome to learn new things about Threads, CPU-bound and GIL thank you ! – katze Jul 22 '14 at 08:47
0

Ok I'm not sure of your question, but I would do this (Note that there may be a better solution because I'm not an expert with the Queue Object) :

If you want to multithread your searches :

class myThread (threading.Thread):

    def __init__(self, obj):

        threading.Thread.__init__(self)

        self.result = None

        self.obj = obj

    #Function who is called when you start your Thread
    def run(self)

        #Execute your function here
        self.result = search_in_genome(self.obj)




if __name__ == '__main__':

    result_final=[]
    objects=['obj1','obj2','obj3',...]

    #List of Thread
    listThread = []

    #Count number of potential thread
    allThread = objects.len()
    allThreadDone = 0

    for item in objects:

        #Create one thread
        thread = myThread(item)

        #Launch that Thread
        thread.start()

        #Stock it into the list
        listThread.append(thread)


    while True:

        for thread in listThread:

            #Count number of Thread who are finished
            if thread.result != None:

                #If a Thread is finished, count it
                allThreadDone += 1

        #If all thread are finished, then stop program
        if allThreadDone == allThread:
            break
        #Else initialyse flag to count again
        else:
            allThreadDone = 0

If someone can check and validate this code that would be better. (Sorry for my english btw)

katze
  • 1,273
  • 3
  • 16
  • 24
  • in here how do you control the number of threads that are running at the time?! does it check the CPU power to calculate the number of threads which could be run at the same time? – Masih Jul 21 '14 at 13:29
  • To know if a thread is running, you can use thread.isAlive() which return True or False. About CPU power, I never used it in my program but maybe you can check here : http://stackoverflow.com/questions/276052/how-to-get-current-cpu-and-ram-usage-in-python – katze Jul 21 '14 at 13:42
  • Don't use threads for this. In Python, only one thread can run bytecode at a time, so you won't get any performance boost by parallelizing CPU-bound tasks across threads. – dano Jul 21 '14 at 13:57
  • @katze When I say "bytecode" I just mean executing the instructions in compiled Python code. Unless the OP's program is primarily reading from disk or a DB or doing some other I/O-bound operation, it will be primarily executing bytecode. I guess it's possible I misinterpreted what the OP was saying the workers were doing, but I think they're doing CPU-bound tasks. – dano Jul 21 '14 at 15:29
  • @katze You should use `thread.join()` to wait for each thread to finish, rather than using the `while True:` loop you're using now: `for thread in listThread: thread.join()` – dano Jul 21 '14 at 15:30
  • Ok Thanks, I'll remember that for the future ;) – katze Jul 22 '14 at 08:44