0

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?

I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.

This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.

Shall I use Pool or Process in this case?

Thanks!

----The code below is added on 8/28---

I programmed with multiprocessing. The key challenges I'm facing are:

1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?

2) I want to check the response value of every requests, and have exception handling

The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!

def get_data(endpoint, get_params):
    response = requests.get(endpoint, params = get_params)
    if response.status_code != 200:
        raise Exception("bad response for " + str(get_params))
    return response.json()

def get_currency_data(endpoint, currency, date):
    get_params = {'currency': currency,
                  'date' : date
                  }
    for attempt in range(3):
        try:
            output = get_data(endpoint, get_params)
            # additional return value check
            # ......
            return output['value']
        except:
            time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
    return 'error'

def get_all_data(currencies, dates):
    # I have many dates, but not too many currencies
    for currency in currencies:
        results = []
        pool = Pool(processes=20)
        for date in dates:
            results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
        output = [p.get() for p in results]
        pool.close()
        pool.join()
        time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
abisko
  • 663
  • 8
  • 21
  • 1
    A Pool contains many processes, but allows you to interact with them as a single entity (e.g., to perform a map across a list). For your use case, however, you should consider asynchronous calls rather than multiprocessing, since you don't need the extra CPU cycles and you would avoid the overhead of launching and communicating with a bunch of processes. Consider the [following article](https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html) about solving exactly this kind of problem – scnerd Aug 23 '17 at 16:37
  • Please accept an answer (check the mark next to it) if it solves your problem – scnerd Aug 23 '17 at 18:59

1 Answers1

3

Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)

#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

async def run(r):
    url = "http://localhost:8080/{}"
    tasks = []

    # Fetch all responses within one Client session,
    # keep connection alive for all requests.
    async with ClientSession() as session:
        for i in range(r):
            task = asyncio.ensure_future(fetch(url.format(i), session))
            tasks.append(task)

        responses = await asyncio.gather(*tasks)
        # you now have all response bodies in this variable
        print(responses)

def print_responses(result):
    print(result)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)

Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.


EDIT: Use requests_futures

As per @roganjosh comment below, requests_futures is a super-easy way to accomplish this.

from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content 
            for url, future in responses.items()
            if future.result().status_code == 200}

EDIT: Use grequests to support Python 2.7

You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.

import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]

EDIT: Using multiprocessing

If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.

It's actually pretty straightforward, you're mapping the URL's through the http GET function:

import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)

The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.

Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

scnerd
  • 5,836
  • 2
  • 21
  • 36
  • Btw, [`requests_futures`](https://pypi.python.org/pypi/requests-futures) takes this logic and bundles it about as simply as `requests` does with `urllib`. Very handy library where 3 lines of code can get the job done :) – roganjosh Aug 23 '17 at 17:15
  • @roganjosh Oooh, I've seen that in passing, but I've never actually looked into using that module. Great suggestion, thanks! – scnerd Aug 23 '17 at 17:28
  • No worries, I was so happy when I discovered it so have to share! At the very bottom of my answer to my own question [here](https://stackoverflow.com/questions/40872671/requests-grequests-is-the-connection-pool-is-full-discarding-connection-w) shows how it's used. I just realised there's a typo in the list comp; `fire_queries` should be `fire_requests`. `queries` is just a list of strings for the URLs. – roganjosh Aug 23 '17 at 17:32
  • I'm using Python 2.7. requests_futures or asyncio doesn't seem available. Are there any alternatives? Thanks guys!! – abisko Aug 23 '17 at 20:18
  • @Feiiiiiiiiiiiii Looks like Py 2.7 uses [Trollius](https://pypi.python.org/pypi/trollius) as an equivalent to asyncio, or [grequests](https://pypi.python.org/pypi/grequests) as an alternative to requests.async. That should solve what you're looking for. See updated answer. – scnerd Aug 23 '17 at 21:15
  • I have never programmed properly in python 3. `requests_futures` is, for sure, compatible with P2.7 without any non-standard imports. – roganjosh Aug 23 '17 at 21:31
  • I don't have requests_futures or grequests... I work in an organization and I can't install things easily. Is there any other alternative ways? Thanks a lot guys – abisko Aug 23 '17 at 21:52
  • Async programming is based on the idea of issuing a request, then "yield"ing from that function, then getting the result and returning it. That way, when you call the function, the HTTP request is sent, but the function gives up control of its thread without waiting for the response. Thus, Python sees it as a generator. As a generator, you can call "next" on it, which returns the thread's execution to that function, retrieves the response, and returns it. You might be able to hack together your own minimal async library like this... open a new question if you want help doing that, though. – scnerd Aug 23 '17 at 22:00
  • I've added an example of doing this using multiprocessing. It answers the question, and if you don't have requests, just replace it with whatever HTTP module you do have to make an equivalent GET function. – scnerd Aug 23 '17 at 22:06
  • @scnerd Thank you very much for your answer. I just posted my code. The original question I posted is simplified. The actual code is more complex. Can you take a look and let me know if my code is the right way of doing things? Thanks a lot! – abisko Aug 28 '17 at 17:37
  • @Feiiiiiiiiiiiii You can always use a generic retry loop: make a set of items to be processed; process them; set the todo list to only those items that failed; as long as the todo list is not empty (and you haven't retried too many times), repeat. Your code seems fine to me (I'd pull the Pool out of the loop, so it's re-used between currencies, but that's just me), but stackoverflow isn't a code review forum. If you have further difficulties, please open new questions for them. – scnerd Aug 28 '17 at 17:51