Recursively copying Content from one path to another of s3 buckets using boto in python

Question

I am not able to find any solution for recusively copying contents from one to another in s3 buckets using boto in python.

suppose a bucket B1 contains has key structure like: B1/x/* I want to copy all the objects recursively from key like B/x/* to B/y/*

mootmoot · Accepted Answer · 2016-05-13T09:46:26.750

There is not "directory" in S3. Those "/" separator is just part of object name, that's why boto doesn't have such features. Either write a script to deal with it or use third party tools.

AWS customerapps show s3browser that provide such arbitrary directory copying functionality. The typical free version only spawn two threads to move file, the paid version allow you to specify more threads and run faster.

Or you just write script and use s3.client.copy_object to copy the file to another name, then delete them afterwards. e.g.

import boto3
s3 = boto3.client("s3")
# list_objects_v2() give more info

more_objects=True
found_token = True
while more_objects :
  if found_token :
    response= s3.list_objects_v2(
      Bucket="mybucket", 
      Prefix="B1/x/",
      Delimiter="/")
  else:   
    response= s3.list_objects_v2(
      Bucket="mybucket",
      ContinuationToken=found_token,
      Prefix="B1/x/",
      Delimiter="/")
  # use copy_object or copy_from
  for source in object_list["Contents"]:
    raw_name = source["Key"].split("/")[-1] 
    new_name = "new_structure/{}".format(raw_name)
    s3.copy_object(
      ....
    )       
    # Now check there is more objects to list
    if "NextContinuationToken" in response:
      found_token = response["NextContinuationToken"]
      more_objects = True
    else:
      more_objects = False

** IMPORTANT NOTES ** : list_object only return maximum 1000 keys per listing, MaxKey will not change the limit. So you must use list_objects_v2 and check whether NextContinuationToken is returned, to make sure the is more object, repeat it until exhausted.

I know everything in s3 is like key value things. Waht i want to do is copy objects content recursively from some part of key to another. — Nitish Agarwal, May 12 '16 at 08:25
Just give s3browser a try. Otherwise you need to write your own script. — mootmoot, May 12 '16 at 08:30
A new user claims (in an answer, which I've flagged for deletion) that "Since NextContinuationToken always contain a value this loop will never end.". I assume that was meant as a reply to this answer. — Mark Amery, Jan 07 '17 at 14:17
@MarkAmery : it will end. The last loop doesn't contains `NextContinuationToken` element. — mootmoot, Jan 08 '17 at 14:14

score 5 · Answer 2 · edited May 13 '21 at 19:50

Just trying to build on previous answer:

s3 = boto3.client('s3')


def copyFolderFromS3(pathFrom, bucketTo, locationTo):
    response = {}
    response['status'] = 'failed'
    getBucket = pathFrom.split('/')[2]
    location = '/'.join(pathFrom.split('/')[3:])
    if pathFrom.startswith('s3://'):
        copy_source = { 'Bucket': getBucket, 'Key': location }
        uploadKey = locationTo
        recursiveCopyFolderToS3(copy_source,bucketTo,uploadKey)


def recursiveCopyFolderToS3(src,uplB,uplK):
    more_objects=True
    found_token = True
    while more_objects:
        if found_token:
            response = s3.list_objects_v2(
                Bucket=src['Bucket'], 
                Prefix=src['Key'],
                Delimiter="/")
        else:   
            response = s3.list_objects_v2(
                Bucket=src['Bucket'],
                ContinuationToken=found_token,
                Prefix=src['Key'],
                Delimiter="/")
        for source in response["Contents"]:
            raw_name = source["Key"].split("/")[-1]
            raw_name = raw_name
            new_name = os.path.join(uplK,raw_name)
            if raw_name.endswith('_$folder$'):
                src["Key"] = source["Key"].replace('_$folder$','/')
                new_name = new_name.replace('_$folder$','')
                recursiveCopyFolderToS3(src,uplB,new_name)
            else:
                src['Key'] = source["Key"]
                s3.copy_object(CopySource=src,Bucket=uplB,Key=new_name)       
                if "NextContinuationToken" in response:
                    found_token = response["NextContinuationToken"]
                    more_objects = True
                else:
                    more_objects = False

Or you an also use the simple awscli which is by default installed on EC2/emr machines.

import subprocess

cmd='aws s3 cp '+path+' '+uploadUrl+' --recursive' 
p=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
p.communicate()

funny that aws cli is actually written in python, yet the official python library has such limitations! I Found this answer, may be also helpful https://stackoverflow.com/a/25327704/1937263 — makasprzak, Apr 18 '18 at 11:07
In `recursiveCopyFolderToS3`, if else loop seems reversed. If found_token, then it should set ContinuationToken. Also, the if loop (if "NextContinuationToken" in response) can be outside the for loop — Dhawal, Aug 15 '18 at 01:04

score 0 · Answer 3 · answered Dec 09 '21 at 14:05

Instead of using boto3, I opt for aws-cli and sh. See the aws s3 cp docs for full list of arguments, which you can include as kwargs in the following (reworked from my own code) which can be used to copy to / from / between S3 buckets and / or local targets:

import sh  #  also assumes aws-cli has been installed

def s3_cp(source, target, **kwargs):
    """
    Copy data from source to target. Include flags as kwargs
    such as recursive=True and include=xyz
    """
    args = []
    for flag_name, flag_value in kwargs.items():
        if flag_value is not False:  # i.e. --quiet=False means omit --quiet
            args.append(f"--{flag_name}")
        if flag_value is not True:  # i.e. --quiet=True means --quiet      
            args.append(flag_value)
    args += [source, target]
    sh.aws("s3", "cp", *args)

bucket to bucket (as per the OP's question):

s3_cp("s3://B1/x/", "s3://B1/y/", quiet=True, recursive=True)

or bucket to local:

s3_cp("s3://B1/x/", "my-local-dir/", quiet=True, recursive=True)

Personally I found that this method gave improved transfer time (of a few GB over 20k small files) from a couple of hours to a few minutes compared to boto3. Perhaps under the hood it's doing some threading or simply opening few connections - but that's just speculation.

Warning: it won't work on Windows.

Related: https://stackoverflow.com/a/46680575/1571593

score 0 · Answer 4 · answered Sep 30 '22 at 11:40

Another boto3 alternative, using the higher level resource API rather than client:

import os

import boto3


def copy_prefix_within_s3_bucket(
    endpoint_url: str,
    bucket_name: str,
    old_prefix: str,
    new_prefix: str,
) -> None:
    bucket = boto3.resource(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
        aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    ).Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=old_prefix):
        old_key = obj.key
        new_key = old_key.replace(old_prefix, new_prefix)
        copy_source = {"Bucket": bucket_name, "Key": old_key}
        bucket.copy(copy_source, new_key)


if __name__ == "__main__":
    copy_prefix_within_s3_bucket(
        endpoint_url="my_endpoint_url",
        bucket_name="my_bucket_name",
        old_prefix="my_old_prefix",
        new_prefix="my_new_prefix",
    )

Recursively copying Content from one path to another of s3 buckets using boto in python

4 Answers4