0

I'm building a script that scrapes a website for some information. I expect to be making around 10k GET requests, and I'm speeding it up using multithreading. Something like this:

import requests
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=N) as pool:
    pool.map(requests.get, urls)

While I want my script to be fast, at the same time I want to be nice to the website's servers (and not be mistaken for a DOS attempt). While I found plenty of questions about how to multithread requests, I couldn't find any on a typical rate for making requests.

In this situation, what's a typical limit for the number of concurrent requests (max_workers)?

user3932000
  • 671
  • 8
  • 24
  • The relevant RFC ( https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html ) says not to use more than 2. – Ian McLaird Dec 28 '20 at 19:25
  • 1
    @IanMcLaird RFC2616 was written in 1999, so I'm not sure how much weight that guideline holds today. – user3932000 Dec 28 '20 at 19:27
  • 1
    You are correct. And the RFC seems to have been revised. This related question asks what limits are used by browsers, which might be useful to you as a starting point. https://stackoverflow.com/questions/985431/max-parallel-http-connections-in-a-browser – Ian McLaird Dec 28 '20 at 19:42
  • @IanMcLaird Thank you! That is definitely useful. – user3932000 Dec 28 '20 at 21:50

0 Answers0