Connectivity issue to database after failover

Question

We have a cluster of Tomcat servers in AWS BeansTalk connected to AWS RDS (MySQL) with Multi-AZ availability.

Some days ago, the RDS instance had a patch applied to the OS which triggered a failover to another RDS instance based on the Multi-AZ availability.

The result was a Production system down during hours (it was at night) until we restarted the Tomcats in each instance. We had thousands of Connection refused errors to database.

According to AWS support, when a failover instance is launched, the endpoint is the same but its IP is changed, and my Tomcats had the old IP cached. So after restarting Tomcat the cache was cleared, the new IP was used and the connectivity issue was resolved. They refer me to this SO question.

That makes a lot of sense however I couldn't reproduce the issue in a controlled test with the same application in Production.

I changed the IP of a domain in /etc/hosts and my current BeansTalk Production Tomcat detected the IP change 30 seconds later, so it should have detected the RDS endpoint IP change too.

The Java ttl property in my BeansTalk environment is set as:

#networkaddress.cache.ttl=-1

So, by default it takes 30 secs as cache, that matches with my experiment.

[EDIT] As suggested in the comments, I've tried to simulate a failover through DNS. In this case, I've changed a CNAME record from a domain to another domain. I did the same test and Tomcat detected the change again 30 seconds later.

Do you have any idea why in this case the RDS endpoint IP change was not detected by Tomcat/JVM?

Not a java person, but changing the hosts file does not seem fully analogous to changing DNS. — Michael - sqlbot, Oct 03 '18 at 12:00
It's a simplification. From my understanding, it could be used to check the networkaddress cache ttl hypothesis. — IsidroGH, Oct 03 '18 at 12:33
I don't think that's necessarily true, though. Name resolution is often implemented in a more convoluted way than we are inclined to suspect, and what holds for one mechanism may not hold for the other. I'd suggest that testing with actual DNS records is needed, for a conclusive answer. Java is notorious for holding on to DNS records forever. — Michael - sqlbot, Oct 03 '18 at 12:47

Connectivity issue to database after failover

0 Answers0