I have a Python lambda that needs to access the AWS API. If it is not associated with a VPC subnet, it works. But when it is associated with a VPC subnet, it gets the exception botocore.exceptions.EndpointConnectionError
with the message Could not connect to the endpoint URL: "https://ec2.us-east-1.amazonaws.com/"
. I've seen this kind of problem described here and here, usually caused by a missing NAT gateway route. However, I have all the correct "pieces" and it still doesn't work.
What I have is:
- A Lambda with associations that create two logical ENI's:
- two private subnets, one for each AZ.
- one security group that is fully permissive (all inbound/outbound traffic allowed)
- The private subnets have custom route tables with a route sending 0.0.0.0/0 to a NAT Gateway
- Each NAT Gateway is in a public subnet (one for each AZ) which has a custom route table with a route sending 0.0.0.0/0 to the IGW for the VPC.
- All the subnets are associated with a NACL that allows all inbound and outbound traffic.
When I inspect Flow Logs, I see that the lambda ENI's are successfully originating DNS requests (port 53), like this one:
2 405857719141 eni-03bb24a034d226e5c 10.136.95.104 10.136.7.233 38109 53 17 1 73 1571250675 1571250733 ACCEPT OK
There are no other VPC flow log records besides this...nothing indicating "REJECTED". My actual Python code, which works when the lambda is not associated with a VPC, looks something like this:
def lambda_handler(event, context):
from botocore.client import Config
from botocore.session import Session
logger.info(f'Create Session')
s = Session()
logger.info(f'Session Created')
logger.info(f'fetching client')
ec2_res = boto3.resource('ec2')
logger.info(f'got vpc resource')
#I've tried different approaches to creating a client
ec2_client = s.create_client('ec2',config=Config(connect_timeout=45, read_timeout=45, retries={'max_attempts': 0}))
#ec2_client = boto3.client('ec2', config=config)
#ec2_client = boto3.client('ec2', endpoint_url="https://aws.amazon.com/ec2",config=config)
#ec2_client = boto3.client('ec2',endpoint_url=endpoint)
logger.info(f'fetched_client')
route_table_id = os.environ['fromTGWRouteTableId']
logger.info(f'got route table id from environment')
try:
logger.info(f'route table(s):{route_table_id}')
#this request will throw an exception in 40 seconds.
route_table = ec2_client.describe_route_tables(RouteTableIds=[route_table_id])
logger.info(f'got client response for route_tables')
rt = route_table['RouteTables'][0]
logger.info(f'The RT ID is: {rt.id}')
except Exception as e:
logger.info(f'{type(e)}')
logger.info(f'{e}')
return
I had to adjust the lambda and boto3 client timeouts just right to actually capture the error. Anything else would result in a timeout. Here are the CloudWatch log entries for the lambda:
START RequestId: a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Version: $LATEST
[INFO] 2019-10-16T20:22:38.914Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Create Session
[INFO] 2019-10-16T20:22:39.36Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Session Created
[INFO] 2019-10-16T20:22:39.92Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 fetching client
[INFO] 2019-10-16T20:22:39.94Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 Found credentials in environment variables.
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 got vpc resource
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 fetched_client
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 got route table id from environment
[INFO] 2019-10-16T20:22:40.33Z a3ce5c07-f5ec-4b91-b6d8-c94e05fbecc9 route table(s):rtb-0d92f4db98072d6fc
[INFO] 2019-10-16T20:22:49.493Z bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 <class 'botocore.exceptions.EndpointConnectionError'>
[INFO] 2019-10-16T20:22:49.493Z bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 Could not connect to the endpoint URL: "https://ec2.us-east-1.amazonaws.com/"
END RequestId: bd2bf6b7-2fa6-46ea-8115-cb830cb07f32
REPORT RequestId: bd2bf6b7-2fa6-46ea-8115-cb830cb07f32 Duration: 40960.97 ms Billed Duration: 41000 ms Memory Size: 128 MB Max Memory Used: 83 MB
2 unknown eni-07003b087845964ff - - - - - - - 1571257388 1571257400 - NODATA
Any ideas of what I'm overlooking?
Update
In my Python code, I've added the following test:
contents = urllib.request.urlopen("https://google.com").readline()
logger.info(f'http response: {contents}')
The above throws a URLError with the message urlopen error [Errno -3] Temporary failure in name resolution
.
I then created an Ubuntu EC2 instance in a public subnet of my VPC. A ping test to google.com failed with "unknown host". If I explicitly provide a public internet IP address to ping, then it works.
Likewise host
and dig
failed, as shown:
ubuntu@ip-10-136-80-220:/etc$ host google.com
;; connection timed out; no servers could be reached
ubuntu@ip-10-136-80-220:/etc$ dig google.com
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached
I can make dig
succeed if I explicitly point it at a public DNS server. This worked: dig @8.8.8.8 google.com
.
Below are the contents of my resolv.conf, with the real company name masked with "mycompany.com":
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.136.7.233
nameserver 10.136.7.249
search preprod.awse1.mycompany.com
The above corresponds to the following DHCP option set.
domain-name = preprod.awse1.mycompany.com; domain-name-servers = 10.136.7.233, 10.136.7.249;
I think both of the above DNS servers are provided in a different AWS account. Still, a ping
test fails on both of those DNS server addresses. I'm not sure if that means these servers don't exist, or if they simply do not respond to ICMP.
Just now I created my own DHCP Option Set, the same as the above, but I changed the DNS servers to 8.8.8.8 and 8.8.4.4 and associated this to the VPC. I then revised my lambda to output the contents of /etc/resolv.conf to verify it "took" the new 8.8.8.8/8.8.4.4 DNS servers - and the lambda still got the same DNS errors! It is very strange that an explicit dig @8.8.8.8 google.com
from the EC2 instance works, but a lambda associated with the same subnet gets a DNS error. I'm wondering if the ephemeral ENI's associated with the Lambda have their own DNS server records - and they are not updating quickly enough to reflect the changes I've made to my lambda?
Incidentally, the VPC has "DNS resolution" and "DNS hostnames" both enabled
.
Why would DNS not be working? As shown, it doesn't matter whether I'm using my own DNS servers or those provided by google.