3

I did a load test for NAT Gateway in AWS. I reached a much lower requests-limit than described in docs. According to the docs the Nat is supposed to support ~900 requests per second, but with my configuration, I saw that ~0.04% of the requests are untreated when running ~300 requests per second.

I run node.js app using ECS cluster, and have the ability to configure requests per second. The NAT is working fine around 1 minute, and later my app starts to get timeouts for few requests.

AWS does not allow access to such machines, and the cloudwatch metrics seem fine.

In general, I am looking for a static ip solution that will withstand high loads. Does anyone here has experienced something similar?

Oron Bendavid
  • 1,485
  • 3
  • 18
  • 34
  • 1
    How exactly did you test this? If you can give more information it'll be easier to understand why you may have seen what you've seen – Chris Williams Jun 06 '20 at 14:57
  • I've ran node.js app using ECS cluster. I have the ability to configure requests per second. The NAT is working fine around 1 minute, and later my app start to get timeout for few requests. – Oron Bendavid Jun 06 '20 at 15:41
  • Right so firstly bare in mind the connections need to be terminated otherwise you'll hit the 55,000 max connections. Secondly there is a throughput per each node in the cluster. If that caps the instance will be the throughput problem not the NAT GW. Can you elaborate what your test setup looks like. Are you multi node? Are you terminating connections immediately? – Chris Williams Jun 06 '20 at 15:44
  • My node.app uses superagent with 'Connection: close', the ECS cluster runs few EC2 instances. Each node.js app runs several requests (simple GET request) per second with 1 second delay between each interval. – Oron Bendavid Jun 06 '20 at 15:52
  • 1
    Check the [`ErrorPortAllocation` Cloudwatch metric](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway-cloudwatch.html) for your NAT Gateway. The ~900/second value is for connections to a single destination IP address, but is not a simple rate or capacity limit; how the connections are torn down by the devices on each side determines what you can accomplish over time/sustained. That error counter should be incrementing if the NAT Gateway has any role in the undesirable behavior you are observing. Please advise what you observe. – Michael - sqlbot Jun 06 '20 at 19:46
  • Hi, thanks for your reply, but as I mentioned in the body of the question, all cloud watch metrics looks fine (including ErrorPortAllocation). I've also added FLOW LOGS, with filter REJECT, but I can't see any relevant information. – Oron Bendavid Jun 06 '20 at 20:48

0 Answers0