-1

I have about 100,000 URL lists. All are in the same domain and have different subdirectories. What is the fastest way to check the status codes for this 100,000 URL list? I am currently making a request using threading and PyCurl as follows. How do I create threads more efficiently and make Web requests faster?

import pycurl
import certifi

from threading import Thread

def req(url, counter):
    try:
        curl = pycurl.Curl()
        curl.setopt(pycurl.CAINFO, certifi.where())
        curl.setopt(pycurl.WRITEFUNCTION, lambda x: None)
        curl.setopt(pycurl.CONNECTTIMEOUT, 5)
        curl.setopt(pycurl.URL,url)
        curl.perform()
        print(f"Requests: {counter} | URL: {url} | Status Code: {curl.getinfo(pycurl.HTTP_CODE)}")
        curl.close()

    except pycurl.error:
        pass

with open("urllist.txt") as f:
    urls = f.read().splitlines()

counter = 0

while True:
    for url in urls:
        counter += 1
        Thread(target=req, args=(url, counter, )).start()

Additional note, you suggested a similar question, so I linked it.

I actually tried this and it was very fast. Maybe this is the fastest "In the request section", but is it the first preparatory process? It takes a lot of time.

  • Why does it need to be the fastest way? That would require extensive research and testing. What would you consider fast enough for your purposes? – Peter Wood Dec 19 '20 at 10:26

1 Answers1

0

You want to look into curl's multi-interface which does concurrent transfers on the same thread. Even at 100k requests you are i/o bound. Once you have using the multi-interface, you could split your work-load in more threads instances either via internal threading as above, or just start separate processes (if you are on Linux see xargs -P or GNU Parallel).

Allan Wind
  • 23,068
  • 5
  • 28
  • 38