I'm trying to delete a lot of files in s3. I am planning on using a multiprocessing.Pool
for doing all these deletes, but I'm not sure how to keep the s3.client
alive between jobs. I'm wanting to do something like
import boto3
import multiprocessing as mp
def work(key):
s3_client = boto3.client('s3')
s3_client.delete_object(Bucket='bucket', Key=key)
with mp.Pool() as pool:
pool.map(work, lazy_iterator_of_billion_keys)
But the problem with this is that a significant amount of time is spent doing the s3_client = boto3.client('s3')
at the start of each job. The documentation says to make a new resource instance for each process so I need a way to make a s3 client for each process.
Is there any way to make a persistent s3 client for each process in the pool or cache the clients?
Also, I am planning on optimizing the deletes by sending batches of keys and using s3_client.delete_objects
, but showed s3_client.delete_object
in my example for simplicity.