I have about 100,000 URL lists. All are in the same domain and have different subdirectories. What is the fastest way to check the status codes for this 100,000 URL list? I am currently making a request using threading and PyCurl as follows. How do I create threads more efficiently and make Web requests faster?
import pycurl
import certifi
from threading import Thread
def req(url, counter):
try:
curl = pycurl.Curl()
curl.setopt(pycurl.CAINFO, certifi.where())
curl.setopt(pycurl.WRITEFUNCTION, lambda x: None)
curl.setopt(pycurl.CONNECTTIMEOUT, 5)
curl.setopt(pycurl.URL,url)
curl.perform()
print(f"Requests: {counter} | URL: {url} | Status Code: {curl.getinfo(pycurl.HTTP_CODE)}")
curl.close()
except pycurl.error:
pass
with open("urllist.txt") as f:
urls = f.read().splitlines()
counter = 0
while True:
for url in urls:
counter += 1
Thread(target=req, args=(url, counter, )).start()
Additional note, you suggested a similar question, so I linked it.
I actually tried this and it was very fast. Maybe this is the fastest "In the request section", but is it the first preparatory process? It takes a lot of time.