Keeping boto3 session alive in multiprocessing pool

Question

I'm trying to delete a lot of files in s3. I am planning on using a multiprocessing.Pool for doing all these deletes, but I'm not sure how to keep the s3.client alive between jobs. I'm wanting to do something like

import boto3
import multiprocessing as mp

def work(key):
    s3_client = boto3.client('s3')
    s3_client.delete_object(Bucket='bucket', Key=key)

with mp.Pool() as pool:
    pool.map(work, lazy_iterator_of_billion_keys)

But the problem with this is that a significant amount of time is spent doing the s3_client = boto3.client('s3') at the start of each job. The documentation says to make a new resource instance for each process so I need a way to make a s3 client for each process.

Is there any way to make a persistent s3 client for each process in the pool or cache the clients?

Also, I am planning on optimizing the deletes by sending batches of keys and using s3_client.delete_objects, but showed s3_client.delete_object in my example for simplicity.

Does the following question and its answers help? https://stackoverflow.com/questions/51310604/how-to-use-boto3-client-with-python-multiprocessing — Emre Sevinç, Sep 10 '19 at 14:00
No, that question is more about passing objects into a multiprocess pool. — TheStrangeQuark, Sep 10 '19 at 14:05

score 4 · Answer 1 · answered Sep 10 '19 at 14:25

Check this snippet from the RealPython concurrency tutorial. They create a single request Session for each process since you cannot share resources because each pool has its own memory space. Instead, they create a global session object to initialize the multiprocessing pool, otherwise, each time the function is called it would instantiate a Session object which is an expensive operation.

So, following that logic, you could instantiate the boto3 client that way and you would only create one client per process.

import requests
import multiprocessing
import time

session = None


def set_global_session():
    global session
    if not session:
        session = requests.Session()


def download_site(url):
    with session.get(url) as response:
        name = multiprocessing.current_process().name
        print(f"{name}:Read {len(response.content)} from {url}")


def download_all_sites(sites):
    with multiprocessing.Pool(initializer=set_global_session) as pool:
        pool.map(download_site, sites)


if __name__ == "__main__":
    sites = [
        "https://www.jython.org",
        "http://olympus.realpython.org/dice",
    ] * 80
    start_time = time.time()
    download_all_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} in {duration} seconds")

I like the idea of this, but I avoid using `global` and top level variables whenever possible. I benchmarked this method in [my solution](https://stackoverflow.com/a/57874665/4498684) to show it's still fast though. — TheStrangeQuark, Sep 10 '19 at 16:15
not entirely sure what's going on here, or how it helps. Why is this different than just creating a global session object outside the function? When would the global session variable be set to None during runtime? — bgenchel, Aug 04 '20 at 21:11

score 2 · Answer 2 · answered Sep 10 '19 at 16:14

I ended up solving this using functools.lru_cache and a helper function for getting the s3 client. An LRU cache will stay consistent in a process, so it will preserve the connection. The helper function looks like

from functools import lru_cache

@lru_cache()
def s3_client():
    return boto3.client('s3')

and then that is called in my work function like

def work(key):
    s3_client = s3_client()
    s3_client.delete_object(Bucket='bucket', Key=key)

I was able to test this and benchmark it in the following way:

import os
from time import time

def benchmark(key):
    t1 = time()
    s3 = get_s3()
    print(f'[{os.getpid()}] [{s3.head_object(Bucket='bucket', Key=key)}] :: Total time: {time() - t1} s')

with mp.Pool() as p:
    p.map(benchmark, big_list_of_keys)

And this result showed that the first function call for each pid would take about 0.5 seconds and then subsequent calls for the same pid would take about 2e-6 seconds. This was proof enough to me that the client connection was being cached and working as I expected.

Interestingly, if I don't have @lru_cache() on s3_client() then subsequent calls would take about 0.005 seconds, so there must be some internal caching that happens automatically with boto3 that I wasn't aware of.

And for testing purposes, I benchmarked Milton's answer in the following way

s3 = None

def set_global_session():
    global s3
    if not s3:
        s3 = boto3.client('s3')

with mp.Pool(initializer=set_global_session) as p:
    p.map(benchmark, big_list_of_keys)

And this also had averaging 3e-6 seconds per job, so pretty much the same as using functools.lru_cache on a helper function.

Keeping boto3 session alive in multiprocessing pool

2 Answers2