235

I'm aware that with Boto 2 it's possible to open an S3 object as a string with: get_contents_as_string()

Is there an equivalent function in boto3 ?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Gahl Levy
  • 3,211
  • 2
  • 13
  • 7

8 Answers8

366

read will return bytes. At least for Python 3, if you want to return a string, you have to decode using the right encoding:

import boto3

s3 = boto3.resource('s3')

obj = s3.Object(bucket, key)
obj.get()['Body'].read().decode('utf-8') 
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
  • 1
    to get this answer to work, I had to `import botocore` as `obj.get()['Body']` is of type `` – Tzunghsing David Wong Sep 29 '17 at 02:45
  • 2
    @TzunghsingDavidWong you shouldn't have to import a package to call methods on an existing object, right? Was that maybe only necessary while experimenting? – Ken Williams Oct 06 '17 at 21:49
  • 2
    what is the value of key in the obj = s3.Object(bucket,key) ** bucket is buckername?? and key is the file name???*** please correct me if i m wrong... – Amaresh Jana Nov 21 '17 at 05:19
  • 2
    @Amaresh yes, bucket = bucket name and key = filename – Tipster Jan 26 '18 at 22:55
  • 1
    if a key is pdf format , is it work ? or please suggest another useful way, I tried import textract text = textract.process('path/to/a.pdf', method='pdfminer') It will sow import error – Arun Kumar Feb 27 '18 at 05:01
  • 1
    @gatsby-lee's answer below is MUCH faster than this. I get 120mb/s vs 24mb/s – Jakobovski Nov 11 '20 at 14:43
  • This should be the accepted answer. Simplest way for simple cases. If you deal with big files, it is obvious you would want some stream support and multi part, but it is advanced usage. – zenbeni Mar 20 '23 at 16:25
172

I had a problem to read/parse the object from S3 because of .get() using Python 2.7 inside an AWS Lambda.

I added json to the example to show it became parsable :)

import boto3
import json

s3 = boto3.client('s3')

obj = s3.get_object(Bucket=bucket, Key=key)
j = json.loads(obj['Body'].read())

NOTE (for python 2.7): My object is all ascii, so I don't need .decode('utf-8')

NOTE (for python 3): We moved to python 3 and discovered that read() now returns bytes so if you want to get a string out of it, you must use:

j = json.loads(obj['Body'].read().decode('utf-8'))

EvgenyKolyakov
  • 3,310
  • 2
  • 21
  • 31
86

This isn't in the boto3 documentation. This worked for me:

object.get()["Body"].read()

object being an s3 object: http://boto3.readthedocs.org/en/latest/reference/services/s3.html#object

Gahl Levy
  • 3,211
  • 2
  • 13
  • 7
  • 1
    assuming "Body" contains string data, ou can use object.get()["Body"].read() to convert to a Python string. – roehrijn Nov 24 '15 at 12:59
  • 35
    boto3 get terrible doc, as of 2016. – Andrew_1510 Feb 25 '16 at 16:50
  • 5
    http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Object.get tells us the return value is a dict, with a key "Body" of type StreamingBody, searching for that in read the docs gets you to http://botocore.readthedocs.io/en/latest/reference/response.html which will tell you to use read(). – jeffrey Apr 04 '17 at 22:52
  • 8
    seems that now `get expected at least 1 arguments, got 0`. Remove the `get()` and access the "Body" object property directly – lurscher Dec 13 '18 at 16:33
51

Python3 + Using boto3 API approach.

By using S3.Client.download_fileobj API and Python file-like object, S3 Object content can be retrieved to memory.

Since the retrieved content is bytes, in order to convert to str, it need to be decoded.

import io
import boto3

client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=bucket_name, Key=object_key, Fileobj=bytes_buffer)
byte_value = bytes_buffer.getvalue()
str_value = byte_value.decode() #python3, default decoding is utf-8
Gatsby Lee
  • 650
  • 6
  • 9
  • 2
    This is MUCH faster than `object.get()["Body"].read()` method. – Jakobovski Nov 11 '20 at 14:41
  • FYI, if the content size is big, you will have pressure in memory. – Gatsby Lee Jun 23 '21 at 18:52
  • honestly, although I am the writer of this reply, I keep coming back to refer the code again. lol – Gatsby Lee Sep 27 '22 at 00:04
  • 2
    This should be the answer. This is faster than any other method (@Jakobovski) because it uses multi-parts download. Basically it spreads the download of different pieces of the file on multiple processes and then merge the result together – ciurlaro Nov 22 '22 at 15:11
  • @ciurlaro can you share where I can find out the logic you described? – Gatsby Lee Nov 23 '22 at 16:05
  • 1
    @GatsbyLee you can find it mentioned [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_fileobj): *"[...] will perform a multipart download in multiple threads if necessary"*. Notice that you can customise furtherly the parallelisation details (concurrency, batch_size, etc) specifying the `Config` which must be an instance of [the class `TransferConfig`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig) – ciurlaro Nov 24 '22 at 12:59
  • 1
    @GatsbyLee I made a separate answer with an example, it was too much code to be written here – ciurlaro Nov 24 '22 at 14:21
4

Decoding the whole object body to one string:

obj = s3.Object(bucket, key).get()
big_str = obj['Body'].read().decode()

Decoding the object body to strings line-by-line:

obj = s3.Object(bucket, key).get()
reader = csv.reader(line.decode() for line in obj['Body'].iter_lines())

The default encoding in bytes' decode() is already 'utf-8' since Python 3.

When decoding as JSON, no need to convert to string, as json.loads accepts bytes too, since Python 3.6:

obj = s3.Object(bucket, key).get()
json.loads(obj['Body'].read())
ericbn
  • 10,163
  • 3
  • 47
  • 55
4

Fastest approach

As stated in the documentation here, download_fileobj uses parallelisation:

This is a managed transfer which will perform a multipart download in multiple threads if necessary.

Quoting aws documentation:

You can retrieve a part of an object from S3 by specifying the part number in GetObjectRequest. TransferManager uses this logic to download all parts of an object asynchronously and writes them to individual, temporary files. The temporary files are then merged into the destination file provided by the user.


This can be exploited keeping the data in memory instead of writing it into a file.

The approach that @Gatsby Lee has shown does it and that's the reason why it is the fastest among those that are listed. Anyway, it can be improved even more using the Config parameter:

import io
import boto3

client = boto3.client('s3')
buffer = io.BytesIO()

# This is just an example, parameters should be fine tuned according to:
# 1. The size of the object that is being read (bigger the file, bigger the chunks)
# 2. The number of threads available on the machine that runs this code

config = TransferConfig(
    multipart_threshold=1024 * 25,   # Concurrent read only if object size > 25MB
    max_concurrency=10,              # Up to 10 concurrent readers
    multipart_chunksize=1024 * 25,   # 25MB chunks per reader
    use_threads=True                 # Must be True to enable multiple readers
)

# This method writes the data into the buffer
client.download_fileobj( 
    Bucket=bucket_name, 
    Key=object_key, 
    Fileobj=buffer,
    Config=config
)

str_value = buffer.getvalue().decode()

For objects bigger than 1GB, it is already worth it in terms of speed.

ciurlaro
  • 742
  • 10
  • 22
1

import boto3
s3 = boto3.client('s3')
S3.get_object(Bucket=bucket_name, Key=s3_key).["Body"]

is of type <class 'botocore.response.StreamingBody'>

When you call

s3.get_object(Bucket=bucket_name, Key=s3_key)["Body"], 

you are accessing the StreamingBody object that represents the content of the S3 object as a stream. This allows you to read the data in chunks and process it incrementally.

s3.get_object(Bucket=bucket_name, Key=s3_key).["Body"].read()

On the other hand, when you call s3.get_object(Bucket=bucket_name, Key=s3_key)["Body"].read(), you are reading the entire content of the object into memory and returning it as a bytes object. This is not efficient if the object is large, as it can quickly consume a lot of memory.

Ajay
  • 176
  • 6
-7

If body contains a io.StringIO, you have to do like below:

object.get()['Body'].getvalue()
Pyglouthon
  • 552
  • 4
  • 15