You can use streaming methods with boto / s3 but you have to define your own file-like objects AFAIK.
Luckily there's smart_open which handles that for you; it also supports GCS, Azure, HDFS, SFTP and others.
Here's an example using a large sample of sales data:
import boto3
from smart_open import open
session = boto3.Session() # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024 # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
data = f_in.read(chunk_size)
if not data:
break
f_out.write(data)
byte_count += len(data)
print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()
The sample file has 2 million lines and it's 75 MB compressed and 238 MB uncompressed.
I uploaded the compressed file to mybucket
and ran the code which downloaded the file, extracted the contents in memory and uploaded the uncompressed data back to S3.
On my computer the process took around 78 seconds (highly dependent on Internet connection speed) and never used more than 95 MB of memory; I think you can lower the memory requirements if need be by overriding the part size for S3 multipart uploads in smart_open.
DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""