We have a cluster of Tomcat servers in AWS BeansTalk connected to AWS RDS (MySQL) with Multi-AZ availability.
Some days ago, the RDS instance had a patch applied to the OS which triggered a failover to another RDS instance based on the Multi-AZ availability.
The result was a Production system down during hours (it was at night) until we restarted the Tomcats in each instance. We had thousands of Connection refused
errors to database.
According to AWS support, when a failover instance is launched, the endpoint is the same but its IP is changed, and my Tomcats had the old IP cached. So after restarting Tomcat the cache was cleared, the new IP was used and the connectivity issue was resolved. They refer me to this SO question.
That makes a lot of sense however I couldn't reproduce the issue in a controlled test with the same application in Production.
I changed the IP of a domain in /etc/hosts and my current BeansTalk Production Tomcat detected the IP change 30 seconds later, so it should have detected the RDS endpoint IP change too.
The Java ttl property in my BeansTalk environment is set as:
#networkaddress.cache.ttl=-1
So, by default it takes 30 secs as cache, that matches with my experiment.
[EDIT] As suggested in the comments, I've tried to simulate a failover through DNS. In this case, I've changed a CNAME record from a domain to another domain. I did the same test and Tomcat detected the change again 30 seconds later.
Do you have any idea why in this case the RDS endpoint IP change was not detected by Tomcat/JVM?