What's the fastest way to send 100,000 web requests with Python3?

Question

I have about 100,000 URL lists. All are in the same domain and have different subdirectories. What is the fastest way to check the status codes for this 100,000 URL list? I am currently making a request using threading and PyCurl as follows. How do I create threads more efficiently and make Web requests faster?

import pycurl
import certifi

from threading import Thread

def req(url, counter):
    try:
        curl = pycurl.Curl()
        curl.setopt(pycurl.CAINFO, certifi.where())
        curl.setopt(pycurl.WRITEFUNCTION, lambda x: None)
        curl.setopt(pycurl.CONNECTTIMEOUT, 5)
        curl.setopt(pycurl.URL,url)
        curl.perform()
        print(f"Requests: {counter} | URL: {url} | Status Code: {curl.getinfo(pycurl.HTTP_CODE)}")
        curl.close()

    except pycurl.error:
        pass

with open("urllist.txt") as f:
    urls = f.read().splitlines()

counter = 0

while True:
    for url in urls:
        counter += 1
        Thread(target=req, args=(url, counter, )).start()

Additional note, you suggested a similar question, so I linked it.

I actually tried this and it was very fast. Maybe this is the fastest "In the request section", but is it the first preparatory process? It takes a lot of time.

Why does it need to be the fastest way? That would require extensive research and testing. What would you consider fast enough for your purposes? — Peter Wood, Dec 19 '20 at 10:26

Allan Wind · Answer 1 · 2020-12-19T10:24:36.387

0

You want to look into curl's multi-interface which does concurrent transfers on the same thread. Even at 100k requests you are i/o bound. Once you have using the multi-interface, you could split your work-load in more threads instances either via internal threading as above, or just start separate processes (if you are on Linux see xargs -P or GNU Parallel).

edited Dec 19 '20 at 10:24

answered Dec 19 '20 at 10:15

Allan Wind

23,068
5
28
38

Isn't this a comment, especially since you include the phrase "non-answer"? – roganjosh Dec 19 '20 at 10:19
With "non-answer" I was referring to thinking outside the python box by leveraging the OS. I removed it as it confused you. – Allan Wind Dec 19 '20 at 10:27

What's the fastest way to send 100,000 web requests with Python3?

1 Answers1