Python "[WARNING] Retrying after connection broken by SSLError"

Question

I am running a restful API using Python SQLAlchemy. We are running serverless on an AWS lambda proxy, the problem I encountered is that sometimes any API endpoints being call would result in a 504 HTTP Status Code. It's not a particular endpoint timing out, it feels pretty random and it happens let's say every 20th API call made (very peculiar). The API gateway has a timeout of 30 seconds.

I am using Python with SQLAlchemy and a PostgreSQL database. Digging into the logs, I found this error:

[WARNING] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) 
after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol
(_ssl.c:1091)'))': /api/6261774/envelope/

which seems to be consistently happening when the 504 error occurs.

Any support much appreciated.

Initial thoughts:

Give it's occasional, it could be a network connect on failure?

Provisioned concurrency does run the initialization code which I believe is where the db connection is made. Would it be worth giving it a try with a single instance?

Where is the database hosted? There is a possibility that firewall or rate limiting may be causing issues. — pygeek, Jun 16 '22 at 05:54
I'd concentrate on "SSLError: EOF occurred in violation of protocol", there are several questions about it even here on Stack Overflow. It might be a fault related to the network if a proxy is involved, but it's a server sided fault so it's not about the direct network connection. Question is if a special request is causing this fault, so I'd try to compare working backtraces with non-working ones. — David, Jun 17 '22 at 00:40
Here is one related random example question: https://stackoverflow.com/questions/33410577/python-requests-exceptions-sslerror-eof-occurred-in-violation-of-protocol — David, Jun 17 '22 at 00:41
Is your end point timing out connecting to PostgreSQL? Are you clients seeing errors? If you can push all of your logs into Cloudwatch (i.e. database, application, API) and turn on debugging then you can more easily visualize the timing of the error and the events leading up to it. — Ray Garcia, Jun 17 '22 at 09:39
Update: We havent resolved the error yet, still looking for a concrete answer. We are using Aurora and we have tracked it down that its because of it (before we used RDS which didnt have this problem). — Sorin Burghiu, Jun 17 '22 at 13:06

score 0 · Answer 1 · answered Jun 17 '22 at 09:48

I recommend turning on debugging in all layers of your application and pushing your logs into Cloudwatch. Next, you can use Cloudwatch log insights (it's kind of like Splunk) to look at the timelines of the errors and the events leading up to your errors.

Here is an article that talks about AWS Cloudwatch log insights if you are new to the service:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html

from the article:

CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.

CloudWatch Logs Insights includes a purpose-built query language with a few simple but powerful commands. CloudWatch Logs Insights provides sample queries, command descriptions, query autocompletion, and log field discovery to help you get started. Sample queries are included for several types of AWS service logs.

CloudWatch Logs Insights automatically discovers fields in logs from AWS services such as Amazon Route 53, AWS Lambda, AWS CloudTrail, and Amazon VPC, and any application or custom log that emits log events as JSON.

Here is an article that talks about sending PostgreSQL logs to Cloudwatch: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.Concepts.PostgreSQL.html

General steps are:

choose the types of database logs you want (query failures, deadlocks, fatal errors, etc)
set log retention
set log rotation
set log destination
send to Cloudwatch

Thanks, we have logs enabled, sadly nothing too concrete is coming through. Thought the error stated above in the question might be more common, but apparently its very broad! — Sorin Burghiu, Jun 17 '22 at 13:07
It sounds like packet capture might be your next dat point. Check out https://aws.amazon.com/blogs/networking-and-content-delivery/using-vpc-traffic-mirroring-to-monitor-and-secure-your-aws-infrastructure/. This is an example process you can go down but you will likely need a network engineer to interpret the results. Just an idea. — Ray Garcia, Jun 17 '22 at 13:45

Python "[WARNING] Retrying after connection broken by SSLError"

1 Answers1