0

My current code creates the separate Session object for every request through the .get() method:

content_getters.py (the relevant part):

def get_page_content(link: str) -> bytes:
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; "
                             "Intel Mac OS X 10_11_6) "
                             "AppleWebKit/537.36 (KHTML, like Gecko) "
                             "Chrome/61.0.3163.100 Safari/537.36"}

    response = requests.get(link, headers=headers)

    html = response.content.decode("utf-8")

    if response.status_code != requests.codes.ok:
        raise ConnectionError("Page", link, "returned status code",
                              response.status_code)

    return response.content

def parse_single_page(link):
    content = get_page_conent(link)
    # rest of very long function

main.py:

from concurrent.futures.thread import ThreadPoolExecutor

from content_getters import get_page_content, extract_links, parse_single_page

if __name__ == "__main__":
    MAX_THREADS = 30

    # get links
    html: str = get_page_content(
        "https://www.d20pfsrd.com/bestiary/bestiary-hub/monsters-by-cr/") \
        .decode("utf-8")

    links = extract_links(html)

    num_threads = min(MAX_THREADS, len(links))
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        # asynchronous, threads will return results when they finish their
        # own work
        results = [result for result
                   in executor.map(parse_single_page, links)]

requests docs (link) state that "if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase". I suppose that my separate calls to the .get() method create separate Session objects for each call, which can be faster.

Question: Is the Session object synchronous (sequential) for all requests made with it? Will I still get asynchronous requests if I use the same Session object for all threads in concurrent.futures.thread.ThreadPoolExecutor, instead of 1 Session per thread as I'm doing now?

qalis
  • 1,314
  • 1
  • 16
  • 44
  • This might help https://stackoverflow.com/questions/18188044/is-the-session-object-from-pythons-requests-library-thread-safe – Prajwal May 28 '21 at 08:34

1 Answers1

1

In short, Session is not thread-safe, you can check the issue discussion on Github.

For your case, I would highly recommend to look toward the asyncio and the aiohttp module, where you will have freedom to pass around a session since everything will be in one thread. It also won't induce as much overhead as the multithreading. As they say:

Use asyncio when you can, use threads when you must

The documentation on aiohttp.

rawrex
  • 4,044
  • 2
  • 8
  • 24
  • Very interesting, I completely forgot about the asyncio, thanks! – qalis May 28 '21 at 08:47
  • 1
    @qalis it is awesome! May have a bit of learning curve, but totally worth it. Would suggest this [article](https://realpython.com/async-io-python/) to check before official documentation, which is quite verbose. – rawrex May 28 '21 at 08:50