4

I an trying to run this Python code on several threads of my processor, but I can't find how to allocate multiple threads. I am using python 2.7 in Jupyter (formerly IPython). The initial code is below (all this part works perfectly). It is a web parser which takes x i.e., a url among my_list i.e., a list of url and then write a CSV (where out_string is a line).

Code without MultiThreading

my_list = ['http://stackoverflow.com/', 'http://google.com']

def main():
    with open('Extract.csv'), 'w') as out_file:
        count_loop = 0
        for x in my_list:
            #================  Get title ==================#
            out_string = ""
            campaign = parseCampaign(x)
            out_string += ';' + str(campaign.getTitle())

            #================ Get Profile ==================#
            if campaign.getTitle() != 'NA':
                creator = parseCreator(campaign.getCreatorUrl())
                out_string += ';' + str(creator.getCreatorProfileLinkUrl())
            else:
                pass
            #================ Write ==================#
            out_string += '\n'
            out_file.write(out_string) 
            count_loop +=1
            print '---- %s on %s ------- ' %(count_loop, len(my_list))

Code with MultiThreading but not working

from threading import Thread
my_list = ['http://stackoverflow.com/', 'http://google.com']

def main(x):
    with open('Extract.csv'), 'w') as out_file:
        count_loop = 0
        for x in my_list:
            #================  Get title ==================#
            out_string = ""
            campaign = parseCampaign(x)
            out_string += ';' + str(campaign.getTitle())

            #================ Get Profile ==================#
            if campaign.getTitle() != 'NA':
                creator = parseCreator(campaign.getCreatorUrl())
                out_string += ';' + str(creator.getCreatorProfileLinkUrl())
            else:
                pass
            #================ Write ==================#
            out_string += '\n'
            out_file.write(out_string) 
            count_loop +=1
            print '---- %s on %s ------- ' %(count_loop, len(my_list))

for x in my_list:
    t = Thread(target=main, args=(x,))
    t.start()
    t2 = Thread(target=main, args=(x,))
    t2.start()

I cannot find a good way to implement more than one thread to run this piece of code, and I am a bit confused because the documentation is not very easy to understand. With one core, this code takes 2 hours long, multi-threading will save me lot of time!

SciPy
  • 5,412
  • 4
  • 18
  • 18
  • to run faster the loop, – SciPy Feb 23 '16 at 20:12
  • Here is a shorter version of my code, the real version makes 7 to 10 seconds per loop because there is lot of requests (external API), so 10s * 12 000 urls could be significantly be improved if each core of my processor are used, i.e., core1 = 10s * 3000url s+ core2 = 10s * 3000urls + core3 = 10s * 3000urls + core4 = 10s * 3000urls at the same time... – SciPy Feb 23 '16 at 20:16
  • When you say the multithreading version isn't working, what do you mean. Is there an error, or does it just not run any faster? Check this question, I don't think ipython uses multiple cores even with threading module. http://stackoverflow.com/a/204150/5889975 – steven Feb 23 '16 at 20:26
  • python is not tuned for multi thread for a lot of reasons. try with `multiprocessing` or `asyncio` . – B. M. Feb 23 '16 at 20:26
  • Please tell us what "not working" means. – tdelaney Feb 23 '16 at 20:52

2 Answers2

3

Well... the answer to:

Why would you assign two threads for the same exact task?

is:

to run faster the loop

(see at the comments of the original post)

then something is pretty wrong here.

Dear OP, both of the threads will do exactly the same thing! This means that the first thread will do exactly the same thing as the second.

What you can do is something like the following:

import multiprocessing

nb_cores = 2  # Put the correct amount

def do_my_process_for(this_argument):
  # Add the actual code
  pass

def main():

  pool = multiprocessing.Pool(processes=nb_cores)

  results_of_processes = [pool.apply_async(
      do_my_process, 
      args=(an_argument, ),
      callback=None
  ) for an_argument in arguments_list]

  pool.close()
  pool.join()

Basically, you can think each process/thread as having its own "mind". This means that in your code the first thread will do the process defined in main() for the argument x (taken from your iteration on your list) and the second one will do the same task (the one in the main()) again for x.

What you need is to formulate your process as a procedure having a set of input parameters and a set of output. Then you can create multiple processes, to each of them give one of the desired input parameters and then the process will execute your main routine with the proper parameter.

Hope it helps. See also the code and I think you will understand it.

Also, see:

multiprocessing map and asynchronous map (I don't remember right now the exact name)

and

functools partial

ATOzTOA
  • 34,814
  • 22
  • 96
  • 117
Xxxo
  • 1,784
  • 1
  • 15
  • 24
3

Ok, lets break down your problem.

First of all your main() method processes all the inputs and outputs to a file. When you use main with 2 threads same work is done by both the threads. You need a method that processes only one input and returns output for that input.

def process_x(x):
    #================  Get title ==================#
    out_string = ""
    campaign = parseCampaign(x)
    out_string += ';' + str(campaign.getTitle())

    #================ Get Profile ==================#
    if campaign.getTitle() != 'NA':
        creator = parseCreator(campaign.getCreatorUrl())
        out_string += ';' + str(creator.getCreatorProfileLinkUrl())
    else:
        pass
    #================ Write ==================#
    out_string += '\n'
    return out_string

Now you can call this method in multiple threads and get output of each x separately.

from threading import Thread
my_list = ['http://stackoverflow.com/', 'http://google.com']
threads = list()
for x in my_list:
    t = Thread(target=process_x, args=(x,))
    t.start()

But the problem is this will start n number of threads where n is number of elements in my_list. So, use of multiprocessing.Pool will be better here. So instead, use

from multiprocessing import Pool
pool = Pool(processes=4)              # start 4 worker processes
result_list = pool.map(process_x, my_list)

result_list here will have results of all the list. So now you can save it in file.

with open('Extract.csv'), 'w') as out_file:
    out_file.writelines(result_list)
Muhammad Tahir
  • 5,006
  • 1
  • 19
  • 36
  • Map blocks execution. The asynchronous map is better solution and the applyasync is better than the two (IMHO). – Xxxo Feb 23 '16 at 21:12
  • @Kostas you are right `Pool.map` blocks until it processes all the list of inputs. But here I don't think OP will be doing anything in between `applyasync` and saving the data to file, for which we need to care about blocking of the `map` method. More over `[Pool.applyasync(f, x) for x in my_list]; Pool.close(); Pool.join();` is basically equal to Pool.map in terms of blocking, logically. – Muhammad Tahir Feb 23 '16 at 21:19