3

Context

I have a somewhat large (about 30 GBs) table that I would like to move from Postgres to S3. I am trying to wrap my head around how file-like io.BytesIO() works, and how much memory do I need to provision on the machine to be able to design my code better.

What I have tried to

I am using aiobotcore to move data to s3 and asyncpg to query Postgres. I have built a sample demonstrate to explain question better.

import asyncio
import asyncpg
import io
import aiobotocore 


async def happy_demo():
    # Define the connection to Postgres
    con = await asyncpg.connect(**postgres_credentials)
    # Get a handle onto an aiobotocore session
    session = aiobotocore.get_session()
    # Create file-like object
    file = io.BytesIO()
    # Create the S3 client
    async with session.create_client(**aws_credentials) as client:
        # Create a gzip file
        with gzip.GzipFile(fileobj=file, mode='wb') as gz:
            await con.copy_from_query(query='select * from bar', output=gz, format='csv')
            # Write to S3
            await client.put_object(Bucket="happy_bucket", Key="foo.csv", Body=file.getvalue())
            # Close the connection.
            await con.close()


def main():
    loop = asyncio.get_event_loop()
    loop.run_until_complete(happy_demo())


if __name__ == '__main__':
    main()

What I am trying to understand

Particularly, I am trying to understand, if the table size is around 30gb, do I need to expect to have a machine with at least 30GB of ram to be able to do this operation?

Update

I have updated my code (to include gzipped file) and slightly reformatted my question to include gzip

alt-f4
  • 2,112
  • 17
  • 49
  • 1
    Your code extracts all the data from the table in one single expression. The BytesIO object will require 30GB of RAM. Why do you expect *async* operation to magically chunk anything? – Serge Ballesta Jan 30 '21 at 17:47
  • Hi @SergeBallesta : ). I did not assume the async operation is relevant (hence why I did not add an async tag), sorry if that was confusing in my question. I just wanted to confirm if ("The BytesIO object will require 30GB of RAM") or if there is any streaming (the data gets uploaded gradually from BytesIO to S3). – alt-f4 Jan 30 '21 at 18:00
  • 2
    I think you're better off using an actual file (so you can use ROM for storage) rather than a 30gb all-in-memory `BytesIO` object. In terms of RAM usage, actually sending the file to AWS shouldn't take that much, since only a few kilobytes of the file will be read for each packet sent, and the entire file won't need to be loaded at once. If you _have_ 30gb of RAM available then this would probably be a performance hit, but otherwise the amount of swap your computer would require to accommodate 30gb of data would probably outweigh that. – Green Cloak Guy Jan 30 '21 at 18:06
  • Hi @GreenCloakGuy. I am happy to accept that as an answer if you post it : ) I have also updated my question and code to use gzip. I am slightly confused if using Gzip will actually help me need less RAM – alt-f4 Jan 31 '21 at 10:26
  • I am moving the question regarding Gzip to a separate question as it was not part of the official requirement [here](https://stackoverflow.com/questions/65970965/are-bytesio-objects-loaded-fully-into-memory-or-streamed) – alt-f4 Jan 31 '21 at 18:19

0 Answers0