I spent some time fiddling with the tiny details of the AWS S3 extension for Postgres described here https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-s3-export.html#postgresql-s3-export-access-bucket (postgres extension configuration, roles, policies, tiny function input details).
I want to easily export, then import huge tables for testing purposes (indexes, generated columns, partitions etc) to optimize the database performance.
I am using this extension because I want to avoid to use my laptop to store the file with stuff like the following command which involves a lot of network I/O and is affected by slow internet connections, broken pipes when the connection is being nuked by the Operating System after a while and more of these problems related to huge tables:
# store CSV from S3 to local
aws s3 cp s3://my_bucket/my_sub_path/my_file.csv /my_local_directory/my_file.csv
# import from local CSV to AWS RDS Aurora PostgresSQL
psql -h my_rds.amazonaws.com -U my_username -d my_dbname -c '\COPY table FROM ''my_file.csv'' CSV HEADER'
I managed to export a very big table (160GB) into CSV files to S3 with:
SELECT * from aws_s3.query_export_to_s3(
'SELECT * FROM my_schema.my_large_table',
aws_commons.create_s3_uri(
'my_bucket/my_subpath',
'my_file.csv',
'eu-central-1'
),
options:='format csv'
);
However this ends up in lots of "part files" in S3:
- the first one with that same CSV filename
my_file.csv
- all the others like
my_file.csv_part2
...my_file.csv_part20
and so on
Now, I don't think this is a problem as long as I am able to import back the CSV data somewhere else in AWS RDS Aurora (PostgresSQL). Although I am not sure what strategies could be applied here, if it's better having all these CSV files, or perhaps I can configure the export to use only one huge CSV file (160GB).
Now the import stuff (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.html):
Turns out I have to import all these "part files" with PL/pgSQL, but then I get lost in the details on how to format those strings for the S3 paths and in general I see all sorts of errors (both export and import). One file import takes around 20 minutes, so it's quite frustrating figure out what is going wrong.
- What's wrong with the source code / error below?
- Is there a better way to handle all this export/import at scale (160GB tables)?
DO $$
DECLARE
my_csv_s3_sub_path text;
BEGIN
FOR cnt IN 2..26 LOOP
my_csv_s3_sub_path := 'my_subpath/my_file.csv_part' || cnt;
RAISE NOTICE '% START loading CSV file % from S3', now(), cnt;
SELECT aws_s3.table_import_from_s3(
'my_schema.my_large_table_new',
'',
'(format csv)',
aws_commons.create_s3_uri(
'my_bucket',
my_csv_s3_sub_path,
'eu-central-1'
)
);
RAISE NOTICE '% STOP loading CSV file % from S3', now(), cnt;
END LOOP;
END; $$
The code above gives:
SQL Error [42601]: ERROR: query has no destination for result data
Hint: If you want to discard the results of a SELECT, use PERFORM instead.
Where: PL/pgSQL function inline_code_block line 8 at SQL statement
I think it's related to variables and string interpolation because I need to dynamically generate the CSV file name in S3 to be used in the Postgres AWS extension.
But I had all sorts of other errors before, e.g. some export/import inconsistency in the syntax around the S3 bucket sub-path that was leading to the Postgres AWS S3 extension to throw error HTTP 400:
SQL Error [XX000]: ERROR: HTTP 400. Check your arguments and try again. Where: SQL function "table_import_from_s3" statement 1
Is there a better alternative to export/import huge table from/to AWS RDS Aurora PostgresSQL?