0

I am writing a get method that gets an array of ids and then makes a request for each id. The array of ids can be potentially 500+ and right now the requests are taking 20+ minutes. I have tried several different async methods like aiohttp and async and neither of them have worked to make the requests faster. Here is my code:

async def get(self):
    self.set_header("Access-Control-Allow-Origin", "*")
    story_list = []
    duplicates = []
    loop = asyncio.get_event_loop()
    ids = loop.run_in_executor(None, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
    response = await ids
    response_data = response.json()
    print(response.text)
    for url in response_data:
        if url not in duplicates:
            duplicates.append(url)
            stories = loop.run_in_executor(None, requests.get, "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(
            url))
            data = await stories
            if data.status_code == 200 and len(data.text) > 5:
                print(data.status_code)
                print(data.text)
                story_list.append(data.json())

Is there a way I can use multithreading to make the requests faster?

Demetrius
  • 449
  • 1
  • 9
  • 20

1 Answers1

3

The main issue here is that the code isn't really async.

After getting your list of URL's you are then fetching them one at a time and then awaiting the response.

A better idea would be to filter out the duplicates (use a set) before queuing all of the URLs in the executor and awaiting all of them to finish eg:

async def get(self):
    self.set_header("Access-Control-Allow-Origin", "*")
    stories = []
    loop = asyncio.get_event_loop()
    # Single executor to share resources
    executor = ThreadPoolExecutor()

    # Get the initial set of ids
    response = await loop.run_in_executor(executor, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
    response_data = response.json()
    print(response.text)

    # Putting them in a set will remove duplicates
    urls = set(response_data)

    # Build the set of futures (returned by run_in_executor) and wait for them all to complete
    responses = await asyncio.gather(*[
        loop.run_in_executor(
            executor, requests.get, 
            "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(url)
        ) for url in urls
    ])

    # Process the responses
    for response in responses:
        if response.status_code == 200 and len(response.text) > 5:
            print(response.status_code)
            print(response.text)
            stories.append(response.json())

    return stories
Tim
  • 2,510
  • 1
  • 22
  • 26
  • @Jack in which array? urls? you could change `for url in urls` to `for url in urls if url is not None` to filter out None (NULL) entries. – Tim Jun 03 '19 at 09:42
  • I believe the default executor is a `ThreadPoolExecutor`, so you don't need to construct your own. - – martineau Jun 03 '19 at 09:46
  • 1
    @martineau by default, yes it is a `ThreadPoolExecutor`, however, as it is used twice constructing it outside of `run_in_executor` removes the delay of creating a second instance as well as the starting and shutting down of `os.cpu_count() * 5` threads. – Tim Jun 03 '19 at 09:53