15

I am working with Python (IPython & Canopy) and a RESTful content API, on my local machine (Mac).

I have an array of 3000 unique IDs to pull data for from the API and can only call the API with one ID at a time.

I was hoping somehow to make 3 sets of 1000 calls in parallel to speed things up.

What is the best way of doing this?

Thanks in advance for any help!

minrk
  • 37,545
  • 9
  • 92
  • 87
user7289
  • 32,560
  • 28
  • 71
  • 88
  • 1
    do You consider to use threads ( separate thread for each request)? – oleg Jun 07 '13 at 11:28
  • 1
    I am okay with that as long as its the right option - I imagine the whole affair is 'embarrassingly parallelisable'... – user7289 Jun 07 '13 at 19:30

1 Answers1

37

Without more information about what you are doing in particular, it is hard to say for sure, but a simple threaded approach may make sense.

Assuming you have a simple function that processes a single ID:

import requests

url_t = "http://localhost:8000/records/%i"

def process_id(id):
    """process a single ID"""
    # fetch the data
    r = requests.get(url_t % id)
    # parse the JSON reply
    data = r.json()
    # and update some data with PUT
    requests.put(url_t % id, data=data)
    return data

You can expand that into a simple function that processes a range of IDs:

def process_range(id_range, store=None):
    """process a number of ids, storing the results in a dict"""
    if store is None:
        store = {}
    for id in id_range:
        store[id] = process_id(id)
    return store

and finally, you can fairly easily map sub-ranges onto threads to allow some number of requests to be concurrent:

from threading import Thread

def threaded_process_range(nthreads, id_range):
    """process the id range in a specified number of threads"""
    store = {}
    threads = []
    # create the threads
    for i in range(nthreads):
        ids = id_range[i::nthreads]
        t = Thread(target=process_range, args=(ids,store))
        threads.append(t)

    # start the threads
    [ t.start() for t in threads ]
    # wait for the threads to finish
    [ t.join() for t in threads ]
    return store

A full example in an IPython Notebook: http://nbviewer.ipython.org/5732094

If your individual tasks take a more widely varied amount of time, you may want to use a ThreadPool, which will assign jobs one at a time (often slower if individual tasks are very small, but guarantees better balance in heterogenous cases).

minrk
  • 37,545
  • 9
  • 92
  • 87
  • Quick one what does the :: do above? Why not just a single :? – user7289 Jun 08 '13 at 08:05
  • 8
    It means stride. when you specify a slice, there are three numbers: `start:stop:stride`. So `1::3` means every third element, starting with 1, i.e. `[1,4,7,...]`. This is just a simple way to equally partition a list. – minrk Jun 09 '13 at 09:35
  • 2
    So the double colon just means the stop is unspecified, and defaults to "the end". – minrk Jun 09 '13 at 09:36
  • @minrk - If I am not passing id as a separate parameter and its part of json than how can i do that? – Ashu Jul 14 '19 at 15:46