1

I have a lambda running in VPC. Using which I query ElasticSearch and update data there and delete obsolete data. To facilitate this call, lambda has to assume a role and it calls STS Assume role API for that. but recently, I am seeing intermittent time-outs whenever I try to fetch credentials. The code is :

final AWSSecurityTokenService stsClient = AWSSecurityTokenServiceClientBuilder.standard()
            .withCredentials(new EnvironmentVariableCredentialsProvider())
            .build();

        final STSAssumeRoleSessionCredentialsProvider credentials = new STSAssumeRoleSessionCredentialsProvider.Builder(
            System.getenv(SIM_ROLE_KEY), SIM_SESSION_NAME
        ).withStsClient(stsClient)
            .build();

        final String sessionToken = credentials.getCredentials().getSessionToken();

Exact error :

Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.180.124] failed: connect timed out: com.amazonaws.SdkClientException
com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.180.124] failed: connect timed out

I want to know, what could be the reason behind this intermittent failures and how do we fix it. I also want to know whether intermittent time-outs is a common issue for STS calls?

Things I tried :

1). instead of global end-point sts.amazonaws.com , I configured end-point to be sts.us-east-1.amazonaws.com because I am running the lambda in us-east-1 region. We were still able to see the same error.

2). It did not have the VPC end-point, so i created the VPC end-point. Now it doesn't throw the time-out error. But i am not sure if it is the intended fix. If it was the fix then STS calls would have timed-out all the time. if there is no VPC end-point, then how it is able to connect with sts.amazonaws.com most of the time?.

I can provide more information if needed.

More info : Lambda function has 3 subnets attached. 2 private 1 public. Route Tables for all the subnets.

VPCStack Private Route Table 1 :
Destination       Target
10.0.0.0/16       local
0.0.0.0/0         nat-####1
pl-63a5400a       vpce-####3
VPCStack Private Route Table 2 : 
Destination     Target
10.0.0.0/16     local
0.0.0.0/0       nat-####2
pl-63a5400a     vpce-####4
VPCStack Public Route Table :
Destination    Target
10.0.0.0/16    local
0.0.0.0/0      igw-####5
pl-63a5400a    vpce-####

Thanks.

  • 4
    Is the Lambda function configured to attach to multiple subnets of your VPC? Do all of those subnets have identical routing? – jarmod Jan 31 '22 at 14:44
  • @jarmod , I am kinda new to VPC, can you please tell which subnets should i look for? Private or public? I also checked there are multiple subnets attached to VPC but only one of them is "main". Do you want me to check the routes in the routing table for those subnets? Thanks in advance. – Mitul Khatri Jan 31 '22 at 16:06
  • 1
    Typically you would deploy Elasticsearch into 2+ private subnets and you would [configure](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html) your Lambda functions to access resources in those same private subnets. For a Lambda function in a VPC to access an AWS service such as STS, it needs a network route to the service (either being routed via NAT Gateway or using a configured VPC endpoint to the service). Your original symptoms suggest that your Lambda was in multiple subnets but one or more of those subnets did not have a route to NAT. Is that possible? – jarmod Jan 31 '22 at 16:36
  • 1
    Also, FYI that your Lambda function is automatically launched with the IAM role that you configured it with. You may not need to explicitly assume a second IAM role just for Elasticsearch access if you can leverage the configured IAM role that the Lambda launched with. – jarmod Jan 31 '22 at 16:37
  • 1
    Translation of @jarmod's info: Only attach the Lambda function to **private subnets** within the VPC. – John Rotenstein Feb 01 '22 at 02:04
  • @jarmod , yes my lambda has multiple(3) subnets attached, 2 private 1 public. public subnet does not have NAT target but has IGW target. Can I also know how that would cause the intermittent failure for STS calls.? – Mitul Khatri Feb 01 '22 at 02:57
  • @jarmod , I updated the question description as well, for better understanding. – Mitul Khatri Feb 01 '22 at 03:10
  • Was there a reason to configure the Lambda in a public subnet? It [cannot access](https://stackoverflow.com/questions/52992085/why-cant-an-aws-lambda-function-inside-a-public-subnet-in-a-vpc-connect-to-the) the public internet or AWS services from there. Configure it in private subnets only. And then to provide connectivity to STS, either a) use a VPC Endpoint to STS or b) configure a NAT and IGW in the VPC and route STS traffic from your Lambda's subnets to the NAT. Hint: VPC Endpoints are preferred here. – jarmod Feb 01 '22 at 14:38
  • @JohnRotenstein thank you for applying, and reminding me of the value of, Occam's razor. – jarmod Feb 01 '22 at 15:05

1 Answers1

1

When you configure a Lambda function for VPC access, configure it to connect to private subnets only.

Your original problem causing intermittent connectivity issues to STS is that you configured the Lambda function to connect to both private and public subnets:

  1. Lambda functions cannot reach the internet if they are connected to a public subnet.
  2. Lambda functions cannot reach AWS services if they are connected to a public subnet, unless you have configured a VPC Endpoint for that AWS service.

When you introduced the VPC Endpoint, it worked correctly because all traffic destined for STS routed via the VPC Endpoint and no longer had to rely on a route via your NAT. Routing via your NAT worked for the Lambda functions connected to one of your private subnets, but not for the Lambda functions connected to the public subnet.

jarmod
  • 71,565
  • 16
  • 115
  • 122