0

I am writing a cron job that takes a Postgres table and upload the table to S3.

I am planning to do the following:

  1. Convert the table to csv and store on disk.

  2. Take file from disk and store in s3 using multipart put.

  3. Delete file from disk.

Questions:

  1. Should I copy the file to disk? Is there any way I can temporarly copy to disk without explicitly deleting the file once I have uploaded it?

  2. Does doing this in memory make more sense? Postgres copy out only writes to OutputStream. AFAIK there is no way to upload a output stream to S3.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Joe
  • 11
  • 1
  • 2
  • How big is the output file? Have you considered doing this as an AWS Lambda function instead? Which language are you planning to use? – John Rotenstein Mar 24 '20 at 07:03
  • Planning to use Java. Output file is not larger than 2GB. I considered using Lambda but adds to infrastructure complexity and the team/I are not familiar with it. – Joe Mar 24 '20 at 16:35

1 Answers1

0

All of this could be done in memory, assuming you have enough memory for your table, or read and upload only segments of the table.

If you go with Python, then the S3's MultipartUpload documentation says that it allows bytes or a file for each MultipartUploadPart. This should be similar for other languages. That way you can read the whole table or a segment, convert that to csv like strings, convert them to bytes and then run a MultipartUpload. No files on the disk needed here.

If you go with Java, you can use InputStreams as a source for your upload. To convert your OutputStream to an InputStream have a look at this answer.

AmazonS3 s3 = new AmazonS3Client(new BasicAWSCredentials("accessKey","secretKey"));
s3.putObject(new PutObjectRequest("mybucket", "myfile", inputStream));

If you however just need a backup of your table, then I suggest using pg_dump and uploading the resulting file to S3.

michaelbahr
  • 4,837
  • 2
  • 39
  • 75
  • Postgres copy command reads from a table and writes to an ouputStream: https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html. Is there a way to upload an outputstream to S3 using multipart upload? – Joe Mar 24 '20 at 16:38
  • If I decide to write to disk, does the current approach work? My reasoning is that my ec2 box has much more disk space vs RAM, so it might be better just to write to disk and delete later. – Joe Mar 24 '20 at 17:34
  • That's totally fine :) – michaelbahr Mar 25 '20 at 08:56