7

I have put together a script which uploads data to S3. If the file is less than 5MB it uploads it as one chunk, but if the file is larger it does a multipart upload. I know the thresholds are currently small I am simply testing the script in the meantime. If I run the script from Python by importing every function and running it that way, everything works as intended. I am aware the code needs cleaning as it is not complete yet. However, when I run the script from the command line I am greeted with this error:

Traceback (most recent call last):
  File "upload_files_to_s3.py", line 106, in <module>
    main()
  File "upload_files_to_s3.py", line 103, in main
    check_if_mp_needed(conn, input_file, mb, bucket_name, sub_directory)
  File "upload_files_to_s3.py", line 71, in check_if_mp_needed
    multipart_upload(conn, input_file, mb, bucket_name, sub_directory)
  File "upload_files_to_s3.py", line 65, in multipart_upload
    mp.complete_upload()
  File "/usr/local/lib/python2.7/site-packages/boto/s3/multipart.py", line 304, in complete_upload
    self.id, xml)
  File "/usr/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 1571, in complete_multipart_upload
    response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request

>The XML you provided was not well-formed or did not validate against our published schema

Here is the code:

import sys
import boto
from boto.s3.key import Key
import os
import math
from filechunkio import FileChunkIO


KEY = os.environ['AWS_ACCESS_KEY_ID']
SECRET = os.environ['AWS_SECRET_ACCESS_KEY']

def start_connection():
    key = KEY
    secret = SECRET
    return boto.connect_s3(key, secret)

def get_bucket_key(conn, bucket_name):
    bucket = conn.get_bucket(bucket_name)
    k = Key(bucket)
    return k

def get_key_name(sub_directory, input_file):
    full_key_name = os.path.join(sub_directory, os.path.basename(input_file))
    return full_key_name

def get_file_info(input_file):
    source_size = os.stat(input_file).st_size
    return source_size

def multipart_request(conn, input_file, bucket_name, sub_directory):
    bucket = conn.get_bucket(bucket_name)
    mp = bucket.initiate_multipart_upload(get_key_name(sub_directory, input_file))
    return mp

def get_chunk_size(mb):
    chunk_size = mb * 1048576
    return chunk_size

def get_chunk_count(input_file, mb):
    chunk_count = int(math.ceil(get_file_info(input_file)/float(get_chunk_size(mb))))
    return chunk_count

def regular_upload(conn, input_file, bucket_name, sub_directory):
    k = get_bucket_key(conn, bucket_name)
    k.key = get_key_name(sub_directory, input_file)
    k.set_contents_from_filename(input_file)


def multipart_upload(conn, input_file, mb, bucket_name, sub_directory):
    chunk_size = get_chunk_size(mb)
    chunks = get_chunk_count(input_file, mb)
    source_size = get_file_info(input_file)
    mp = multipart_request(conn, input_file, bucket_name, sub_directory)
    for i in range(chunks):
        offset = chunk_size * i
        b = min(chunk_size, source_size - offset)
        with FileChunkIO(input_file, 'r', offset = offset, bytes = b) as fp:
            mp.upload_part_from_file(fp, part_num = i + 1)
    mp.complete_upload()

def check_if_mp_needed(conn, input_file, mb, bucket_name, sub_directory):
    if get_file_info(input_file) <= 5242880:
        regular_upload(conn, input_file, bucket_name, sub_directory)
    else:
        multipart_upload(conn, input_file, mb, bucket_name, sub_directory)

def main():
    input_file = sys.argv[1]
    mb = sys.argv[2]
    bucket_name = sys.argv[3]
    sub_directory = sys.argv[4]
    conn = start_connection()
    check_if_mp_needed(conn, input_file, mb, bucket_name, sub_directory)

if __name__ == '__main__':
    main()

Thanks!

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
gold_cy
  • 13,648
  • 3
  • 23
  • 45

1 Answers1

0

You have a version mismatch between your two cases. When you use the older version of boto, it is using the wrong schema for AWS and so you see the error.

In a bit more detail, when running in IPython (using the virtualenv) you have version 2.45.0 and when running from the command line you have version 2.8.0 of boto. Given that version 2.8.0 dates back to 2013, it's not surprising that you get a schema error.

The fix is either to upgrade your system version of boto (which you're currently picking up in your script) by running pip install -U boto or convert your script to use the virtual environment. For advice on the latter, look at this other answer on SO: Running python script from inside virtualenv bin is not working

Community
  • 1
  • 1
Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
  • I did `pip install -U boto` and it changed both versions to 2.46.1. I am still getting the same error message however. Keep in mind when I ran this, this last time I did not use a `virtualenv` – gold_cy Feb 27 '17 at 16:44
  • And you still see the different result in IPython from running in the command line? – Peter Brittain Feb 27 '17 at 16:56
  • Nope, they both print the same version, `2.46.1` – gold_cy Feb 27 '17 at 16:56
  • Sorry - I wasn't clear... I meant, do they both now fail with the same error, or does IPython still work for you (and the command line fails)? – Peter Brittain Feb 27 '17 at 17:59
  • IPython still works just fine. Only command line fails as before :-( – gold_cy Feb 27 '17 at 19:31
  • Beats me... I'm wondering if you could be hitting this issue (https://github.com/boto/boto/issues/3536) but don't see how you can be uploading no data. – Peter Brittain Feb 28 '17 at 00:42
  • I will try this tonight after work and report back. Thanks for your help so far. – gold_cy Feb 28 '17 at 18:59
  • That error you showed me is exactly like mine however I am uploading data which is weird. My code works perfectly fine if the file threshold makes it upload as a normal upload. I guess I will have to stick to the AWS CL tools for this, thanks so much for your help. What do the guidelines dictate about accepting? I want to accept however if someone stumbles here later they won't necessarily have a solution. – gold_cy Mar 01 '17 at 14:51
  • I suggest you either mark the response as helpful or award the bounty, but don't mark the question as answered. See http://stackoverflow.com/help/bounty – Peter Brittain Mar 01 '17 at 17:03
  • I awarded the bounty and left the question unsolved. Thanks for your patience and help. – gold_cy Mar 01 '17 at 17:23
  • Thanks. Am happy to carry on if you get more information... We should probably move to a chat channel (http://chat.stackoverflow.com/rooms/new), though. – Peter Brittain Mar 01 '17 at 18:01