If you run this on EC2, the network performance of the EC2 instances varies based on the EC2 type and size. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
A bottleneck can happen at multiple places:
- Network (bandwidth and latency)
- CPU
- Memory
- Local Storage
One can check each of these. CloudWatch Metrics is our friend here.
CPU is the easiest to see and to scale with a bigger instance size.
Memory is a bit harder to observe, but one should have enough memory to keep the document in memory, so the OS does not use the swap.
Local Storage - IO can be observed; If the business logic is just to parse a csv file and output the result in, let's say, another S3 bucket, and there is no need to save the file locally - EC2 instances with local storage can be used - https://aws.amazon.com/ec2/instance-types/ - Storage Optimized.
Network - EC2 instance size can be modified, or Network optimized instances can be used.
Network - the way that one connects to S3 matters. Usually, the best approach is the use an S3 VPC endpoint https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html. The gateway option is free to use. By adopting it, one eliminates the VPC NAT gateway/NAT instance limitations, and it's even more secure.
Network - Sometimes, the S3 is in one region, and the compute is in another. S3 support replication https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
Maybe some type of APM monitoring and code instrumentation can show is the code can also be optimized.
Thank you.