3

I Know this question has been asked quite a few times but read further.

' Little background about my problem
I want to get the response for 10.000 url's. However the big problem is the time to load for each page. Namely for each page there is a big process on the server side (~20-30) minutes. Offcourse I could do this synchronously and tell you next year how long it took ;) But that's not really an option so I searched on stack overflow and the internet and came up with some "answers".

What is the fastest way to send 100,000 HTTP requests in Python?

lib: None
approach: multi-threading

Even when only using 200 requests my CPU reached 100%.(maybe because of the long waiting time?) So I thought multithreading was not the way to go.


Ideal method for sending multiple HTTP requests over Python? [duplicate]

lib: grequests
approach: async

If a lot is really a lot than you probably want use asynchronous io not threads. requests + gevent = grequests

Well that seems good for me, but grequests seems to be kinda deprecated (read the comments on this answer if you are interested). But at least I'm pretty sure I have to go for async instead of multi-threading. So I started searching on the internet.


Making 1 million requests with python-aiohttp

lib: asyncio
approach: async

This seems a good option for me so I adjusted the code and tested it. The problems I faced:

Slow
Till ~200 requests I see the print:

201. Thu, 28 Sep 2017 07:32:18 GMT:http://api.metagenomics.anl.gov/annotation/sequence/mgm4484960.3?source=RDP with delay None

But after this it takes soooo long to add the rest.

Doesn't seem to be proportional
When I try 200 requests is will take ~2 hours whereas one requests should only take 20-30 minutes.



Current code: The code looks long based on the scrollbar but most of it are urls ;)

import asyncio
import time
from aiohttp import ClientPayloadError
from aiohttp import ClientSession

COUNTER = 1

async def fetch(url, session):

    async with session.get(url) as response:
        delay = response.headers.get("DELAY")
        date = response.headers.get("DATE")
        global COUNTER
        COUNTER +=1
        print("{}. {}:{} with delay {}".format(str(COUNTER), date, response.url, delay))
        try:
            return await response.text()
        except ClientPayloadError:
            print("ERROR: ".format(url))


async def bound_fetch(sem, url, session):
    # Getter function with semaphore.
    async with sem:
        await fetch(url, session)


async def run():
    urls = [build_url(id) for id in load_ids()]
    tasks = []
    # create instance of Semaphore
    sem = asyncio.Semaphore(1000)

    # Create client session that will ensure we dont open new connection
    # per each request.
    async with ClientSession(conn_timeout=10000, read_timeout=10000) as session:
        for url in urls:
           #pass Semaphore and session to every GET request
            task = asyncio.ensure_future(bound_fetch(sem, url, session))
            tasks.append(task)

        responses = asyncio.gather(*tasks)
        await responses

def build_url(id):
    url = 'http://api.metagenomics.anl.gov/annotation/sequence/{}?source=RDP'.format(id)
    return url

def load_ids():
    #I will read a file here but added the important content to make it reproducible.
    return """
mgm4558908.3
mgm4484962.3
mgm4734169.3
mgm4558911.3
mgm4484983.3
mgm4558918.3
mgm4735453.3
mgm4735450.3
mgm4558931.3
mgm4734170.3
mgm4524851.3
mgm4485066.3
mgm4484952.3
mgm4484985.3
mgm4485062.3
mgm4484969.3
mgm4485024.3
mgm4485021.3
mgm4485061.3
mgm4485070.3
mgm4449604.3
mgm4510361.3
mgm4558912.3
mgm4485034.3
mgm4484996.3
mgm4524848.3
mgm4485035.3
mgm4734167.3
mgm4485001.3
mgm4735458.3
mgm4484948.3
mgm4508946.3
mgm4484973.3
mgm4485009.3
mgm4453148.3
mgm4485064.3
mgm4510364.3
mgm4547279.3
mgm4510360.3
mgm4484964.3
mgm4453150.3
mgm4548349.3
mgm4484975.3
mgm4558909.3
mgm4484945.3
mgm4734166.3
mgm4572199.3
mgm4485053.3
mgm4558915.3
mgm4485020.3
mgm4445996.3
mgm4484990.3
mgm4485052.3
mgm4485055.3
mgm4485029.3
mgm4558903.3
mgm4558924.3
mgm4735452.3
mgm4485392.3
mgm4484984.3
mgm4558937.3
mgm4485390.3
mgm4523756.3
mgm4485397.3
mgm4485044.3
mgm4484994.3
mgm4485399.3
mgm4485040.3
mgm4558904.3
mgm4558907.3
mgm4485037.3
mgm4485073.3
mgm4484981.3
mgm4525982.3
mgm4558897.3
mgm4547780.3
mgm4485022.3
mgm4524847.3
mgm4484959.3
mgm4558921.3
mgm4485067.3
mgm4484956.3
mgm4558902.3
mgm4558914.3
mgm4735454.3
mgm4453147.3
mgm4484991.3
mgm4484999.3
mgm4558935.3
mgm4485010.3
mgm4485393.3
mgm4558922.3
mgm4558938.3
mgm4508944.3
mgm4544122.3
mgm4485403.3
mgm4734172.3
mgm4735455.3
mgm4558926.3
mgm4558901.3
mgm4484982.3
mgm4485389.3
mgm4485018.3
mgm4558913.3
mgm4485398.3
mgm4734171.3
mgm4558939.3
mgm4485045.3
mgm4485058.3
mgm4572197.3
mgm4523758.3
mgm4484997.3
mgm4445993.3
mgm4484963.3
mgm4484989.3
mgm4485017.3
mgm4735448.3
mgm4485008.3
mgm4485395.3
mgm4547281.3
mgm4510366.3
mgm4485391.3
mgm4558900.3
mgm4558910.3
mgm4484957.3
mgm4558929.3
mgm4485014.3
mgm4485059.3
mgm4735459.3
mgm4525983.3
mgm4485032.3
mgm4485075.3
mgm4547285.3
mgm4547282.3
mgm4485016.3
mgm4484968.3
mgm4485030.3
mgm4484960.3
mgm4763691.3
mgm4558928.3
mgm4484987.3
mgm4734173.3
mgm4485012.3
mgm4484967.3
mgm4485054.3
mgm4485063.3
mgm4523757.3
mgm4485041.3
mgm4558919.3
mgm4734165.3
mgm4485069.3
mgm4484965.3
mgm4547280.3
mgm4484986.3
mgm4508945.3
mgm4523769.3
mgm4485004.3
mgm4524852.3
mgm4558923.3
mgm4485048.3
mgm4485049.3
mgm4485056.3
mgm4484980.3
mgm4734168.3
mgm4485031.3
mgm4558917.3
mgm4558898.3
mgm4735451.3
mgm4508948.3
mgm4484992.3
mgm4484950.3
mgm4485028.3
mgm4510365.3
mgm4485038.3
mgm4558936.3
mgm4485396.3
mgm4485050.3
mgm4510359.3
mgm4485036.3
mgm4558916.3
mgm4484966.3
mgm4485039.3
mgm4558934.3
mgm4445994.3
mgm4523754.3
mgm4484947.3
mgm4524849.3
mgm4484946.3
mgm4485015.3
mgm4524846.3
mgm4485043.3
mgm4476876.3
mgm4485033.3
mgm4524853.3
mgm4523770.3
mgm4485000.3
mgm4485025.3
mgm4453149.3
mgm4558940.3
mgm4484958.3
mgm4484988.3
mgm4485071.3
mgm4484995.3
mgm4485007.3
mgm4485005.3
mgm4735456.3
mgm4485013.3
mgm4484955.3
mgm4485065.3
mgm4558941.3
mgm4510367.3
mgm4485400.3
mgm4484970.3
mgm4558925.3
mgm4558932.3
mgm4485074.3
mgm4484961.3
mgm4484976.3
mgm4572123.3
mgm4558920.3
mgm4548350.3
mgm4485047.3
mgm4558927.3
mgm4485401.3
mgm4484978.3
mgm4485027.3
mgm4547284.3
mgm4559623.3
mgm4503838.3
mgm4485019.3
mgm4484951.3
mgm4485002.3
mgm4484972.3
mgm4485394.3
mgm4572198.3
mgm4558933.3
mgm4523755.3
mgm4484949.3
mgm4558905.3
mgm4524850.3
mgm4558930.3
mgm4510363.3
mgm4485003.3
mgm4484979.3
mgm4485072.3
mgm4484971.3
mgm4485046.3
mgm4485068.3
mgm4485060.3
mgm4476877.3
mgm4484974.3
mgm4735449.3
mgm4485051.3
mgm4558906.3
mgm4735457.3
mgm4525984.3
mgm4503837.3
mgm4445990.3
mgm4508947.3
mgm4485026.3
mgm4485057.3
mgm4484993.3
mgm4510362.3
mgm4485023.3
mgm4485042.3
mgm4484954.3
mgm4485402.3
mgm4484953.3
mgm4523771.3
mgm4485006.3
mgm4572122.3
mgm4547283.3
mgm4558899.3
mgm4524854.3
""".strip().split()



start = time.clock()
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
run_time = (start - time.clock())/60
print("this took: {} minutes".format(run_time))

FINNALY, the question

The best/most efficient 2017 way to send a lot of parallel HTTP requests?

CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • Slight aside, if a single request is that heavy that it takes the server 20-30 minutes to respond, how will it cope with thousands of async requests? – roganjosh Sep 28 '17 at 07:56
  • Well I think that's the question hahah @roganjosh But I don't understand what you are asking, it doesn't really care how much it takes on the server side does it? Besides the limit of requests the server will allow offcourse – CodeNoob Sep 28 '17 at 08:06
  • If it takes that amount of time to process a request, you assume that it has either a CPU or memory-bound process running on the server to calculate a response. A server still has finite resources so even if you could send loads of requests, there's no guarantee it can respond to all in the same time; this is why multiple requests take 2 hours to respond to - it's not a failure of the async library, it's a limitation of how fast the server can actually process them. – roganjosh Sep 28 '17 at 08:13
  • Aah thankyou for explaning that. Is there any way to verify that it is caused at the server side? @roganjosh – CodeNoob Sep 28 '17 at 08:16
  • That I'm not sure. Is it in an internal server? – roganjosh Sep 28 '17 at 09:39

1 Answers1

0

There are several libraries in python that allow for concurrency; however, they generally involve writing a bit of code. You can use the following code to make all 10,000 requests concurrently in one line. You can also set the time_rest parameter to pause in between each concurrent request (in case you're worried about overloading the server). This function will return a dictionary with the URLs as keys and their respective responses as values.

import pygrab

res_dict = pygrab.get_async(list_of_urls, time_rest=0)