5

Given a large gzip object in S3, what is a memory efficient (e.g. streaming) method in python3/boto3 to decompress the data and store the results back into another S3 object?

There is a similar question previously asked. However, all of the answers use a methodology in which the contents of the gzip file are first read into memory (e.g. ByteIO). These solutions are not viable for objects that are too big to fit in main memory.

For large S3 objects the contents need to be read, decompressed "on the fly", and then written to a different S3 object is some chunked fashion.

Thank you in advance for your consideration and response.

Ramón J Romero y Vigil
  • 17,373
  • 7
  • 77
  • 125
  • In what way the approved answer of [this question](https://stackoverflow.com/questions/12571913/python-unzipping-stream-of-bytes) does not solve your problem? If you are running Python 3 the for loop will behave as a generator, so the unzip will be streamed. – Tim Oct 23 '20 at 12:58
  • @Tim It needs to be tied into boto3 for S3 interface. – Ramón J Romero y Vigil Oct 23 '20 at 15:44
  • How much memory are you willing to use? For streaming transfer you will always need to use some memory, because the mulitpart upload API requires content checksums sent in the request *headers* for each part - you'll need a memory buffer at least in the ballpark of `part_size` * `n_concurrent_parts`. Is there any reason why you need to stream, rather than simply using a temporary file? – wim Oct 23 '20 at 18:55
  • Boto3 handles this for you with meta.client.download_file. See answer. – pygeek Oct 23 '20 at 18:56
  • @wim I'm not asking for 0 memory usage, but I am asking that the memory consumed not grow with the size of the file. I suppose a temporary file could work, though it's highly inefficient. – Ramón J Romero y Vigil Oct 23 '20 at 20:13
  • You can buffer to disk but you'll have to be careful for the reader not to "catch up" to the writer and think it's done. I actually have code that does exactly this, backs up a db to disk while simultaneously uploading to s3. I've been meaning to open source for a year now... – Kurt Oct 25 '20 at 03:28
  • @kurt Looking for this capability. Was it open sourced? – Henry Thornton Aug 25 '22 at 11:46

1 Answers1

4

You can use streaming methods with boto / s3 but you have to define your own file-like objects AFAIK.
Luckily there's smart_open which handles that for you; it also supports GCS, Azure, HDFS, SFTP and others.
Here's an example using a large sample of sales data:

import boto3
from smart_open import open

session = boto3.Session()  # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024  # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
    data = f_in.read(chunk_size)
    if not data:
        break
    f_out.write(data)
    byte_count += len(data)
    print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()

The sample file has 2 million lines and it's 75 MB compressed and 238 MB uncompressed.
I uploaded the compressed file to mybucket and ran the code which downloaded the file, extracted the contents in memory and uploaded the uncompressed data back to S3.
On my computer the process took around 78 seconds (highly dependent on Internet connection speed) and never used more than 95 MB of memory; I think you can lower the memory requirements if need be by overriding the part size for S3 multipart uploads in smart_open.

DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""
Ionut Ticus
  • 2,683
  • 2
  • 17
  • 25