I'm aware that with Boto 2 it's possible to open an S3 object as a string with: get_contents_as_string()
Is there an equivalent function in boto3 ?
I'm aware that with Boto 2 it's possible to open an S3 object as a string with: get_contents_as_string()
Is there an equivalent function in boto3 ?
read
will return bytes. At least for Python 3, if you want to return a string, you have to decode using the right encoding:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(bucket, key)
obj.get()['Body'].read().decode('utf-8')
I had a problem to read/parse the object from S3 because of .get()
using Python 2.7 inside an AWS Lambda.
I added json to the example to show it became parsable :)
import boto3
import json
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=key)
j = json.loads(obj['Body'].read())
NOTE (for python 2.7): My object is all ascii, so I don't need .decode('utf-8')
NOTE (for python 3): We moved to python 3 and discovered that read()
now returns bytes
so if you want to get a string out of it, you must use:
j = json.loads(obj['Body'].read().decode('utf-8'))
This isn't in the boto3 documentation. This worked for me:
object.get()["Body"].read()
object being an s3 object: http://boto3.readthedocs.org/en/latest/reference/services/s3.html#object
Python3 + Using boto3 API approach.
By using S3.Client.download_fileobj API and Python file-like object, S3 Object content can be retrieved to memory.
Since the retrieved content is bytes, in order to convert to str, it need to be decoded.
import io
import boto3
client = boto3.client('s3')
bytes_buffer = io.BytesIO()
client.download_fileobj(Bucket=bucket_name, Key=object_key, Fileobj=bytes_buffer)
byte_value = bytes_buffer.getvalue()
str_value = byte_value.decode() #python3, default decoding is utf-8
Decoding the whole object body to one string:
obj = s3.Object(bucket, key).get()
big_str = obj['Body'].read().decode()
Decoding the object body to strings line-by-line:
obj = s3.Object(bucket, key).get()
reader = csv.reader(line.decode() for line in obj['Body'].iter_lines())
The default encoding in bytes' decode()
is already 'utf-8'
since Python 3.
When decoding as JSON, no need to convert to string, as json.loads accepts bytes too, since Python 3.6:
obj = s3.Object(bucket, key).get()
json.loads(obj['Body'].read())
As stated in the documentation here, download_fileobj
uses parallelisation:
This is a managed transfer which will perform a multipart download in multiple threads if necessary.
Quoting aws documentation:
You can retrieve a part of an object from S3 by specifying the part number in GetObjectRequest. TransferManager uses this logic to download all parts of an object asynchronously and writes them to individual, temporary files. The temporary files are then merged into the destination file provided by the user.
This can be exploited keeping the data in memory instead of writing it into a file.
The approach that @Gatsby Lee
has shown does it and that's the reason why it is the fastest among those that are listed.
Anyway, it can be improved even more using the Config
parameter:
import io
import boto3
client = boto3.client('s3')
buffer = io.BytesIO()
# This is just an example, parameters should be fine tuned according to:
# 1. The size of the object that is being read (bigger the file, bigger the chunks)
# 2. The number of threads available on the machine that runs this code
config = TransferConfig(
multipart_threshold=1024 * 25, # Concurrent read only if object size > 25MB
max_concurrency=10, # Up to 10 concurrent readers
multipart_chunksize=1024 * 25, # 25MB chunks per reader
use_threads=True # Must be True to enable multiple readers
)
# This method writes the data into the buffer
client.download_fileobj(
Bucket=bucket_name,
Key=object_key,
Fileobj=buffer,
Config=config
)
str_value = buffer.getvalue().decode()
For objects bigger than 1GB, it is already worth it in terms of speed.
import boto3
s3 = boto3.client('s3')
S3.get_object(Bucket=bucket_name, Key=s3_key).["Body"]
is of type <class 'botocore.response.StreamingBody'>
When you call
s3.get_object(Bucket=bucket_name, Key=s3_key)["Body"],
you are accessing the StreamingBody object that represents the content of the S3 object as a stream. This allows you to read the data in chunks and process it incrementally.
s3.get_object(Bucket=bucket_name, Key=s3_key).["Body"].read()
On the other hand, when you call s3.get_object(Bucket=bucket_name, Key=s3_key)["Body"].read(), you are reading the entire content of the object into memory and returning it as a bytes object. This is not efficient if the object is large, as it can quickly consume a lot of memory.
If body contains a io.StringIO, you have to do like below:
object.get()['Body'].getvalue()