1

I want to run two functions in parallel. These functions are executed many times in a loop. Here is my code:

#get the html content of the first rental
previous_url_rental=BeautifulSoup(urllib.urlopen(rentals[0]))

#for each rental on the page
for rental_num in xrange(1, len(rentals)):
    #get the html content of the page
    url_rental=BeautifulSoup(urllib.urlopen(rentals[rental_num]))
    #get and save the rental data in the csv file
    writer.writerow(get_data_rental(previous_url_rental))
    previous_url_rental=url_rental

#save last rental
writer.writerow(get_data_rental(previous_url_rental))

There are two main things:

1/ get the html content of a page: url_rental=BeautifulSoup(urllib.urlopen(rentals[rental_num]))

2/ retrieve and save data from the html content of the previous page (and not the current page because these 2 processes would be dependent): writer.writerow(get_data_rental(previous_url_rental))

I would like to run these two lines in parallel: a first process would get the html content of the page n+1 while a second process would retrieve and save the data of the page n. I have searched and found this post so far: Python: How can I run python functions in parallel?. But I don't understand how to use it!

Thank you for your time.

Community
  • 1
  • 1
rom
  • 3,592
  • 7
  • 41
  • 71
  • What you see is multiprocessing, which has a long story. Besides why do you have to do it in parallel? You can't write the row when you haven't retrieved the data. – aIKid Nov 11 '13 at 10:47
  • That's why I want to get the page `#n+1` and at the same time write the data of the page `#n`. Is that possible? – rom Nov 11 '13 at 10:50
  • How big is the data we're talking about? How many pages? – aIKid Nov 11 '13 at 10:54
  • and 50 variables to retrieve per rental (int or string) – rom Nov 11 '13 at 14:41

2 Answers2

1

In order to run functions in parallel (i.e. on multiple CPU) in Python, you need to use the Multiprocessing Module.

However, I doubt this is worth the effort for just two instances.

If you can run more than two processes in parallel, use the Pool class from said module, there is an example in the docs.

Each Worker in the Pool would retrieve and save the data from one page, the fetch the next job to do. However this isn't easy, as you writer must be able to handle multiple writes concurrently. So you may also need a Queue to serialize writes, and each worker would just retrieve pages, extract information and send the result to the queue for the writer to handle.

Ber
  • 40,356
  • 16
  • 72
  • 88
  • I see, it seems complicated and might not save much time. I would like to try it anyway to see it for myself. But I don't understand how to adapt the examples from the link you just shared. Can you apply those examples on my specific task or show me other examples? Thank you. – rom Nov 11 '13 at 11:18
1

Maybe the standard Threading module of python is interesting for you? Using a queue as Ber says seems a good thing to me.

This way I use the Threading library(without Queue), you can expand it with Queue if you want to:

#!/usr/bin/python

import threading
from threading import Thread
import time

fetch_stop = threading.Event()
process_stop = threading.Event()

def fetch_rental(arg1, stop_event):
    while(not stop_event.is_set()):
        #fetch content from url and add to Queue

def process_rental(arg1, stop_event):
    while(not stop_event.is_set()):
        #get item(s) from Queue, process them, and write to CSV


try:
    Thread(target=fetch_rental,   name="Fetch rental",   args=(2, fetch_stop  )).start()
    Thread(target=process_rental, name="Process rental", args=(2, process_stop)).start()
    while True:
        time.sleep(10) #wait here while the processes run
except:
    fetch_stop.set()
    process_stop.set()
    exit()

Now you can interact with the processes using Locks and Events(see the docs) When page #n has been downloaded it can be added to a list or to the Queue. Then the second process can be informed that a new page is there to be processed.

LdeV
  • 21
  • 5