Ending Requests Python

Question

I'm using a proxy service to cycle requests with different proxy ips for web scraping. Do I need to build in functionality to end requests so as to not overload the web server I'm scraping?

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures

list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
    
    params = {'api_key': 'API_KEY', 'url': url}
   
    # send request to scraperapi, and automatically retry failed requests
    for _ in range(NUM_RETRIES):
        try:
            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
            if response.status_code in [200, 404]:
                ## escape for loop if the API returns a successful response
                break
        except requests.exceptions.ConnectionError:
            response = ''
    ## parse data if 200 status code (successful response)
    if response.status_code == 200: 
    ## do stuff 

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    executor.map(scrape_url, list_of_urls)

the question is not clear. – balderman Sep 12 '21 at 09:36 — balderman, Sep 12 '21 at 09:36

t.abraham · Answer 1 · 2021-09-12T09:49:51.453

Hi if you are using the latest version of requests, then most probably it is keeping the TCP connection alive. What you can do is to define a request class and set it up not to keep the connections alive and then proceed normally with you code

s = requests.session()
s.config['keep_alive'] = False

As discussed here, there really isn't such a thing as an HTTP connection and what httplib refers to as the HTTPConnection is really the underlying TCP connection which doesn't really know much about your requests at all. Requests abstracts that away and you won't ever see it.

The newest version of Requests does in fact keep the TCP connection alive after your request.. If you do want your TCP connections to close, you can just configure the requests to not use keep-alive.

Alternatively

s = requests.session(config={'keep_alive': False})

Updated version of your code

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures

list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
    
    params = {'api_key': 'API_KEY', 'url': url}
    s = requests.session()
    s.config['keep_alive'] = False
    # send request to scraperapi, and automatically retry failed requests
    for _ in range(NUM_RETRIES):
        try:
            response = s.get('http://api.scraperapi.com/', params=urlencode(params))
            if response.status_code in [200, 404]:
                ## escape for loop if the API returns a successful response
                break
        except requests.exceptions.ConnectionError:
            response = ''
    ## parse data if 200 status code (successful response)
    if response.status_code == 200: 
    ## do stuff 

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    executor.map(scrape_url, list_of_urls)

Ending Requests Python

1 Answers1