I have to make numerous (thousands) of HTTP GET requests to a great deal of websites. This is pretty slow, for reasons that some websites may not respond (or take long to do so), while others time out. As I need as many responses as I can get, setting a small timeout (3-5 seconds) is not in my favour.
I have yet to do any kind of multiprocessing or multi-threading in Python, and I've been reading the documentation for a good while. Here's what I have so far:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Pool
errors = 0
def get_site_content(site):
try :
# start = time.time()
response = requests.get(site, allow_redirects = True, timeout=5)
response.raise_for_status()
content = response.text
except Exception as e:
global errors
errors += 1
return ''
soup = BeautifulSoup(content)
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
return text
sites = ["http://www.example.net", ...]
pool = Pool(processes=5)
results = pool.map(get_site_content, sites)
print results
Now, I want the results that are returned to be joined somehow. This allows two variation:
Each process has a local list/queue that contains the content it has accumulated is joined with the other queues to form a single result, containing all the content for all sites.
Each process writes to a single global queue as it goes along. This would entail some locking mechanism for concurrency checks.
Would multiprocessing or multithreading be the better choice here? How would I accomplish the above with either of the approaches in Python?
Edit:
I did attempt something like the following:
# global
queue = []
with Pool(processes = 5) as pool:
queue.append(pool.map(get_site_contents, sites))
print queue
However, this gives me the following error:
with Pool(processes = 4) as pool:
AttributeError: __exit__
Which I don't quite understand. I'm having a little trouble understanding what exactly pool.map does, past applying the function on every object in the iterable second parameter. Does it return anything? If not, do I append to the global queue from within the function?