2

I need to get HTTP GET response from the top 1-million domains, and I want to open as much concurrent thread as possible so I can finish it faster. The only relevant post that I found is What is the fastest way to send 100,000 HTTP requests in Python? and the solution uses concurrent.futures works as expected.

However, the problem is as I am setting the number of workers higher, the performance gain seems to stagnant, i.e., I do not sense any difference if I set number of workers to 1000 or 10,000. I run it on paid EC2 instance and I can see I am only using a fraction of the available CPU and memory. Not sure what happened, is there a limit that how many concurrent thread that I can create? Can I override the limit?

SamTest
  • 299
  • 3
  • 13

1 Answers1

0

I find there isn't much difference between urllib3 and requests (requests might be a shade faster). I would use an async library since this is a prime use case.

from gevent import monkey, spawn, joinall
monkey.patch_all()
import urllib3, certifi
from time import time

threads = []
url = 'https://www.google.com'
upool = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(), num_pools=20,  block=False)

t0 = time()
for i in xrange(10000):
    threads.append(spawn(upool.request,'GET',url))

x = joinall(threads)

print len(x)
print time() - t0

Notice you can cap the number of connections used at once by adding true to block.

* UPDATE FOR MULTIPROCESSING *

from gevent import monkey, spawn, joinall
monkey.patch_all()
import urllib3, certifi
from time import time
import gipc

worker = {}
num_threads = 1000

def fetch(num_threads, url, cpu):
    print('starting {}'.format(cpu))
    threads = []
    upool = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(), num_pools=20, block=False)
    t0 = time()
    for i in xrange(num_threads):
        threads.append(spawn(upool.request, 'GET', url))
    x = joinall(threads)
    return x, time() - t0

def count_cpus():
    import multiprocessing
    cpus = multiprocessing.cpu_count()
    print(cpus)
    return cpus

def multicore(url):
    global worker
    with gipc.pipe() as (r,w):
        for cpu in range(count_cpus()):
            worker[str(cpu)] = gipc.start_process(target=fetch, args=(num_threads, url, cpu))
    for work in worker:
        worker[work].join()
    return worker

if __name__ == '__main__':
    multicore('https://www.google.com')

    for work in worker:
        print worker[work]
eatmeimadanish
  • 3,809
  • 1
  • 14
  • 20
  • 10000 connections were done in 128.472000122 seconds. – eatmeimadanish Jul 18 '19 at 17:19
  • I am looking at this blog: https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html it seems much faster than urllib, but less intuitive to comprehend. Is it all possible to open 10,000 thread, let each of them to send a request, and then get 10,000 request done instantly? – SamTest Jul 18 '19 at 17:52
  • And, I run your exact code on EC2 t3.2xlarge with 8 vCPU and 32G memory, it throw 'Device or resource Busy' error, I can only do 5000 and can finish it in about 50 seconds. What monster machine you are using to run 10000 at one time? – SamTest Jul 18 '19 at 17:56
  • Keep in mind this is still single threaded. You could leverage multiprocesses on top of this to leverage more cpu's. – eatmeimadanish Jul 18 '19 at 18:23
  • Can you give me a hint how? I am not expert on multi-thread or process programmer but just want the job done. I know the clumsy way is to manually type and run the command twice in Linux then it indeed runs 2 times faster. But is there an intuitive and easy way to accomplish it with Python? – SamTest Jul 18 '19 at 18:31
  • I actually use this in production: https://github.com/jgehrcke/gipc gipc – eatmeimadanish Jul 18 '19 at 18:32
  • I gave an example – eatmeimadanish Jul 18 '19 at 19:19