Multithreaded web scraper—what's a typical concurrent limit?

Asked Dec 28 '20 at 19:18

Active Dec 28 '20 at 19:29

Viewed 96 times

I'm building a script that scrapes a website for some information. I expect to be making around 10k GET requests, and I'm speeding it up using multithreading. Something like this:

import requests
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=N) as pool:
    pool.map(requests.get, urls)

While I want my script to be fast, at the same time I want to be nice to the website's servers (and not be mistaken for a DOS attempt). While I found plenty of questions about how to multithread requests, I couldn't find any on a typical rate for making requests.

In this situation, what's a typical limit for the number of concurrent requests (max_workers)?

edited Dec 28 '20 at 19:29

asked Dec 28 '20 at 19:18

user3932000

The relevant RFC ( https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html ) says not to use more than 2. – Ian McLaird Dec 28 '20 at 19:25
1

@IanMcLaird RFC2616 was written in 1999, so I'm not sure how much weight that guideline holds today. – user3932000 Dec 28 '20 at 19:27
1

You are correct. And the RFC seems to have been revised. This related question asks what limits are used by browsers, which might be useful to you as a starting point. https://stackoverflow.com/questions/985431/max-parallel-http-connections-in-a-browser – Ian McLaird Dec 28 '20 at 19:42
@IanMcLaird Thank you! That is definitely useful. – user3932000 Dec 28 '20 at 21:50

Multithreaded web scraper—what's a typical concurrent limit?

0 Answers0