0

I am using python 3.9 and boto3 to issue the following statements from a lambda function. The function works 90% of the time. But every now and then it stops working for a few hours. The lambda function times out. See Cloud watch log below. The lambda timeout is set to 20 seconds...when it works the s3 file reads in about three hundredths of a second. The get_object function doesn't look like it times out...it just hangs forever. All the exceptions in the try are me just trying to trouble shoot...The function never prints any of the except prints to the log.

Any Ideas why S3 just drops for hours at a time and why the boto3 get_object never times out?

Thanks for any help!


config = boto3.session.Config(connect_timeout=5, read_timeout=5, retries={'max_attempts': 0})
s3 = boto3.client('s3',config=config,aws_access_key_id='XXXX',aws_secret_access_key='XXXX')
file = GetS3file("index.html")


def GetS3file(filename):
    global responseout
    responsefile=""
    try:
        print("trying s3 get_object:samplebucket:"+filename)
        response=s3.get_object(Bucket='samplebucket',Key=filename)
        print("After S3 get_object")
        responsefile=response['Body'].read().decode('utf-8')
        responsefile=str(responsefile)

    except botocore.exceptions.ClientError as error:
        print("Boto Exception:"+str(error))
        responsefile="Page Not Found: "+filename+"  b-error:"+str(error)
    except botocore.exceptions.ReadTimeoutError as error:
        print("Boto read timeout Exception:"+str(error))
        responsefile="Page Not Found: "+filename+"  b-error:"+str(error)
    except botocore.exceptions.ConnectTimeoutError as error:
        print("Boto connect timeout Exception:"+str(error))
        responsefile="Page Not Found: "+filename+"  b-error:"+str(error)
    except Exception as e:
        print("Exception:"+str(e))
        responsefile="Page Not Found: "+filename+"  error:"+str(e)
    return responsefile

Cloudwatch Log...

2023-01-29T13:20:19.390-08:00   INIT_START Runtime Version: python:3.9.v16 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:xxxx

2023-01-29T13:20:20.059-08:00   START RequestId: XXX Version: $LATEST

2023-01-29T13:20:20.059-08:00   Exists:False

2023-01-29T13:20:20.059-08:00   Getting S3 File:index.html

2023-01-29T13:20:20.059-08:00   trying s3 get_object:samplebucket:index.html

2023-01-29T13:20:40.082-08:00   2023-01-29T21:20:40.082Z XXX Task timed out after 20.02 seconds

2023-01-29T13:20:40.082-08:00   END RequestId: XXX

2023-01-29T13:20:40.082-08:00   REPORT RequestId: XXX Duration: 20022.77 ms Billed Duration: 20000 ms Memory Size: 128 MB Max Memory Used: 80 MB Init Duration: 667.52 ms

A SUCCESSFUL log looks like this....and this happens 90% of the time...

2023-01-29T13:20:27.542-08:00   INIT_START Runtime Version: python:3.9.v16 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:XXX

2023-01-29T13:20:28.208-08:00   START RequestId: XXX Version: $LATEST

2023-01-29T13:20:28.208-08:00   Exists:False

2023-01-29T13:20:28.208-08:00   Getting S3 File:index.html

2023-01-29T13:20:28.208-08:00   trying s3 get_object:samplebucket:index.html

2023-01-29T13:20:28.512-08:00   After S3 get_object

2023-01-29T13:20:28.555-08:00   Exists:False

2023-01-29T13:20:28.555-08:00   Getting S3 File:mainhead.html

2023-01-29T13:20:28.555-08:00   trying s3 get_object:samplebucket:mainhead.html

2023-01-29T13:20:28.593-08:00   After S3 get_object

2023-01-29T13:20:28.594-08:00   Exists:False

2023-01-29T13:20:28.594-08:00   Getting S3 File:Header.html

2023-01-29T13:20:28.594-08:00   trying s3 get_object:samplebucket:Header.html

2023-01-29T13:20:28.644-08:00   After S3 get_object

2023-01-29T13:20:28.653-08:00   Exists:False

2023-01-29T13:20:28.653-08:00   Getting S3 File:loginform.html

2023-01-29T13:20:28.653-08:00   trying s3 get_object:samplebucket:loginform.html

2023-01-29T13:20:28.728-08:00   After S3 get_object

2023-01-29T13:20:28.729-08:00   index.html

2023-01-29T13:20:28.733-08:00   END RequestId: XXX

2023-01-29T13:20:28.733-08:00   REPORT RequestId: XXX Duration: 525.06 ms Billed Duration: 526 ms Memory Size: 128 MB Max Memory Used: 81 MB Init Duration: 665.54 ms

I tried to access a file in a bucket using python and and boto3.get_object. I was expecting it to read and it does 90% of the time. But sometimes it stops working for as long as a few hours. and then it starts working again for no discernable reason. I tried logging exceptions and lowering the timeout on the get_object call so the get method will timeout and fail so I can get an error message but it still doesn't timeout in the indicated time.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Ryan Jones
  • 16
  • 3
  • 3
    Is your AWS Lambda function configured to connect to a VPC? If so, is there a reason for doing this? Your symptom sounds like it might be configured to connect to multiple subnets in a VPC, but not all subnets are necessarily Private Subnets. This can cause random timeouts if a Public Subnet is selected. The preferable situation is to _not_ use a VPC, which then provides direct access to the Internet. – John Rotenstein Jan 29 '23 at 22:38
  • To follow up on John's comment, see Intermittent Connectivity [here](https://stackoverflow.com/a/52994841/271415). – jarmod Jan 29 '23 at 23:41
  • YES, it is connected to a VPC, and yes it has to hit a backend private mysql server. Checking those subnets now...this makes sense... – Ryan Jones Jan 30 '23 at 00:12
  • Nice Catch! Yes it has a public subnet. I removed that and everything is still connecting...will let it run for a few days and see if that solves this...but that makes a lot of sense. Thank you so much!!!! – Ryan Jones Jan 30 '23 at 00:16
  • One thought though...why wouldn't the S3.get_object actually timeout at 5 seconds? I would think when the network couldn't connect it should fail with a timeout. – Ryan Jones Jan 30 '23 at 00:20
  • This seems to have fixed the problem. No downtime since I removed the public subnet. How do I change this comment to an answer? – Ryan Jones Jan 31 '23 at 16:55

2 Answers2

0

The comment from John Rotenstein is the answer.

He comments: Is your AWS Lambda function configured to connect to a VPC? If so, is there a reason for doing this? Your symptom sounds like it might be configured to connect to multiple subnets in a VPC, but not all subnets are necessarily Private Subnets. This can cause random timeouts if a Public Subnet is selected. The preferable situation is to not use a VPC, which then provides direct access to the Internet.

That was the case, I had a public subnet, after deleting the public subnet no more downtime was experienced.

Thanks John!

Ryan Jones
  • 16
  • 3
0

This should not occur in most cases if you didn't configure VPC for the Lambda Function

and if you've configured a VPC, make sure the subnets you've attached to it have a route for NAT Gateway or you can use s3 vpc-endpoints.

Once subnets have been configured correctly, this issue might resolve.

DilLip_Chowdary
  • 841
  • 4
  • 18