22

Tried this:

import boto3
from boto3.s3.transfer import TransferConfig, S3Transfer
path = "/temp/"
fileName = "bigFile.gz" # this happens to be a 5.9 Gig file
client = boto3.client('s3', region)
config = TransferConfig(
    multipart_threshold=4*1024, # number of bytes
    max_concurrency=10,
    num_download_attempts=10,
)
transfer = S3Transfer(client, config)
transfer.upload_file(path+fileName, 'bucket', 'key')

Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.

I found this example, but part is not defined.

import boto3

bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'

s3 = boto3.client('s3')

# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
    part1 = s3.upload_part(Bucket=bucket
                           , Key=key
                           , PartNumber=1
                           , UploadId=mpu['UploadId']
                           , Body=data)

# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
    'Parts': [
        {
            'PartNumber': 1,
            'ETag': part['ETag']
        }
    ]
}

# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
                             , Key=key
                             , UploadId=mpu['UploadId']
                             , MultipartUpload=part_info)

Question: Does anyone know how to use the multipart upload with boto3?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
blehman
  • 1,870
  • 7
  • 28
  • 39
  • just saw your question when looking for some other topic, you may want to have a look at s3.transfer which seem to handle multipart automatically: http://boto3.readthedocs.org/en/latest/_modules/boto3/s3/transfer.html . (Never tested it though). Also note that when doing multipart, you will not see multiple part on S3 but one single file. As per AWS documentation: After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object – Tom Feb 08 '16 at 10:12
  • 1
    @Tom Earlier using boto2x we were able to define chunk_size but with boto3 we dont have any option to set the chunk_size. I think he is talking about it. http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.create_multipart_upload – Leo Prince Jul 12 '16 at 11:40
  • Good example: https://gist.github.com/teasherm/bb73f21ed2f3b46bc1c2ca48ec2c1cf5 – Ualter Jr. Jun 13 '20 at 15:46
  • part should be part1 – jjbskir Jan 11 '23 at 19:15

7 Answers7

14

Your code was already correct. Indeed, a minimal example of a multipart upload just looks like this:

import boto3
s3 = boto3.client('s3')
s3.upload_file('my_big_local_file.txt', 'some_bucket', 'some_key')

You don't need to explicitly ask for a multipart upload, or use any of the lower-level functions in boto3 that relate to multipart uploads. Just call upload_file, and boto3 will automatically use a multipart upload if your file size is above a certain threshold (which defaults to 8MB).

You seem to have been confused by the fact that the end result in S3 wasn't visibly made up of multiple parts:

Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.

... but this is the expected outcome. The whole point of the multipart upload API is to let you upload a single file over multiple HTTP requests and end up with a single object in S3.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
11

As described in official boto3 documentation:

The AWS SDK for Python automatically manages retries and multipart and non-multipart transfers.

The management operations are performed by using reasonable default settings that are well-suited for most scenarios.

So all you need to do is just to set the desired multipart threshold value that will indicate the minimum file size for which the multipart upload will be automatically handled by Python SDK:

import boto3
from boto3.s3.transfer import TransferConfig

# Set the desired multipart threshold value (5GB)
GB = 1024 ** 3
config = TransferConfig(multipart_threshold=5*GB)

# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)

Moreover, you can also use multithreading mechanism for multipart upload by setting max_concurrency:

# To consume less downstream bandwidth, decrease the maximum concurrency
config = TransferConfig(max_concurrency=5)

# Download an S3 object
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)

And finally in case you want perform multipart upload in single thread just set use_threads=False:

# Disable thread use/transfer concurrency
config = TransferConfig(use_threads=False)

s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)

Complete source code with explanation: Python S3 Multipart File Upload with Metadata and Progress Indicator

ybonda
  • 1,546
  • 25
  • 38
6

I would advise you to use boto3.s3.transfer for this purpose. Here is an example:

import boto3


def upload_file(filename):
    session = boto3.Session()
    s3_client = session.client("s3")

    try:
        print("Uploading file: {}".format(filename))

        tc = boto3.s3.transfer.TransferConfig()
        t = boto3.s3.transfer.S3Transfer(client=s3_client, config=tc)

        t.upload_file(filename, "my-bucket-name", "name-in-s3.dat")

    except Exception as e:
        print("Error uploading: {}".format(e))
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
deadcode
  • 2,226
  • 1
  • 20
  • 29
  • 6
    That's insteresting, but not in all case... suppose, hipotetically, that you are uploading a 487GB and wants to stop (or it crashed after 95 minutes, etc.) and you want to resume from the point that stopped, be it 295GB, 387GB, whatever. How can you start from the point that you stopped? How can you identify the Parts that you have already upload (it is already uploaded to S3) from the parts that you still need to upload?? And how can you resume from that specific Part?? A low level multipart upload: https://gist.github.com/teasherm/bb73f21ed2f3b46bc1c2ca48ec2c1cf5 – Ualter Jr. Jun 13 '20 at 15:43
0

In your code snippet, clearly should be part -> part1 in the dictionary. Typically, you would have several parts (otherwise why use multi-part upload), and the 'Parts' list would contain an element for each part.

You may also be interested in the new pythonic interface to dealing with S3: http://s3fs.readthedocs.org/en/latest/

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

Why not use just the copy option in boto3?

s3.copy(CopySource={
        'Bucket': sourceBucket,
        'Key': sourceKey}, 
    Bucket=targetBucket,
    Key=targetKey,
    ExtraArgs={'ACL': 'bucket-owner-full-control'})

There are details on how to initialise s3 object and obviously further options for the call available here boto3 docs.

gdlmx
  • 6,479
  • 1
  • 21
  • 39
  • 4
    This will fail for any source object larger than 5 GiB – Noah Yetter Nov 02 '18 at 17:11
  • 2
    @NoahYetter No, this is not the case. Because it performs a multipart copy, it allows for greater size than 5 GB. – Kristoffer Bakkejord Dec 17 '18 at 18:26
  • 2
    @KristofferBakkejord as of the time of my comment, that was not the case. I was doing exactly this, and getting failures due to file size. I had to implement multipart upload by hand. I do see that the s3 client's copy method's documentation now indicates multipart is automatic. – Noah Yetter Feb 08 '19 at 22:17
0

copy from boto3 is a managed transfer which will perform a multipart copy in multiple threads if necessary.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.copy

This works with objects greater than 5Gb and I have already tested this.

Sunny Nazar
  • 146
  • 6
-1

Change Part to Part1

import boto3

bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'

s3 = boto3.client('s3')

# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
    part1 = s3.upload_part(Bucket=bucket
                       , Key=key
                       , PartNumber=1
                       , UploadId=mpu['UploadId']
                       , Body=data)

# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
  'Parts': [
    {
        'PartNumber': 1,
        'ETag': part1['ETag']
    }
   ]
  }

# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
                         , Key=key
                         , UploadId=mpu['UploadId']
                         , MultipartUpload=part_info)
sarath kumar
  • 1,265
  • 10
  • 6