Process large files from S3

Question

I am trying to get a large file (>10gb) on s3 (stored as csv on s3) and send it as a csv in the response header. I am doing it by using the following procedure:

async getS3Object(params:any) {

        s3.getObject(params, function (err, data) {
            if (err) {
              console.log('Error Fetching File');
            }
            else {
                const csv = data.Body.toString('utf-8');
                res.setHeader('Content-disposition', `attachment; filename=${fileId}.csv`);
                res.set('Content-Type', 'text/csv');
                res.status(200).send(csv);
            }
          });

This is taking painfully long to process the file and send it as a csv attachments. How can I make this faster?

You could just redirect directly to S3, letting the user download directly. If needed, you could use a presigned URL that recreates the content disposition and content type in addition to providing access. — Anon Coward, Dec 07 '22 at 06:15
Presigned url works great but I really want to build this as an api, when invoked, returns a csv file as response — boomchickawawa, Dec 07 '22 at 06:19

score 0 · Answer 1 · answered Dec 07 '22 at 06:09

You're dealing with a huge file; you could break that into chunks using range (see also the docs, search for "calling the getobject property"). If you need the whole file, you could split the work off into workers, though at some point the limit will probably be your connection, and if you need to send the whole file as an attachment that won't help much.

A better solution would be to never download the file in the first place. You can do this by streaming from S3 (see also this, and this), or setting up a proxy in your server so the bucket/subdir seems to the client to be in your app.

score 0 · Answer 2 · answered Dec 07 '22 at 19:33

If you run this on EC2, the network performance of the EC2 instances varies based on the EC2 type and size. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html

A bottleneck can happen at multiple places:

Network (bandwidth and latency)
CPU
Memory
Local Storage

One can check each of these. CloudWatch Metrics is our friend here.

CPU is the easiest to see and to scale with a bigger instance size.

Memory is a bit harder to observe, but one should have enough memory to keep the document in memory, so the OS does not use the swap.

Local Storage - IO can be observed; If the business logic is just to parse a csv file and output the result in, let's say, another S3 bucket, and there is no need to save the file locally - EC2 instances with local storage can be used - https://aws.amazon.com/ec2/instance-types/ - Storage Optimized.

Network - EC2 instance size can be modified, or Network optimized instances can be used.

Network - the way that one connects to S3 matters. Usually, the best approach is the use an S3 VPC endpoint https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html. The gateway option is free to use. By adopting it, one eliminates the VPC NAT gateway/NAT instance limitations, and it's even more secure.

Network - Sometimes, the S3 is in one region, and the compute is in another. S3 support replication https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

Maybe some type of APM monitoring and code instrumentation can show is the code can also be optimized.

Thank you.

Process large files from S3

2 Answers2