2

I've recently moved my memcache server behind an Elastic Load Balancer in AWS. I'm also using Flask-Cache with this memcache. If I'm not mistaken (and it's totally possible I am), Flask-Cache opens a connection to memcache and holds it open. It also appears that the ELB terminates these long-standing connections after some period of time (I think it's about 60 minutes). This will result in errors like:

SomeErrors: error 19 from flush_all: (0x4ff96f0) CONNECTION FAILURE, ::rec() returned zero, server has disconnected

If there was some way I could catch these errors and reconnect (or some magic setting to "try to reconnect on connection failure"), that would solve this problem.

FWIW, I'm using pylibmc, but don't see anything obvious (to me) that I could pass.

Any help would be greatly appreciated!

Hoopes
  • 3,943
  • 4
  • 44
  • 60

1 Answers1

1

Being disconnected from ELB is very common and also very difficult to debug. Here are a few things that might help:

Debugging Ideas

  • Attempt to debug the problem in a staging environment with only one instance connected to ELB.
  • Make sure you have application logging with time stamps and that if you catch all exceptions in Python (which is generally not a great idea), that you log the exception. It is possible you have a subtle and hidden bug that appears to be something else if you are catching all exceptions.

  • Simulate the failure (i.e. manually remove "one" instance from ELB), now look at your logs and make sure you see this manifested in your logs. If you can reproduce the same behavior than you can figure out how to fix it.

  • Look into a web service automated testing tool like https://loader.io/. This can be very helpful to simulate the conditions when the disconnects appear to happen.

  • Try the same application with a different load balancer, i.e. HAProxy (I would potentially try this last).

Noah Gift
  • 256
  • 1
  • 4
  • 9
  • Hi Noah, thanks very much for the answer. I'm pretty certain I'm getting disconnected at the ELB, I was hoping for some incantation or setting in `pylibmc` or even redis to auto-reconnect on error when there's a long-running connection held open. I will take your advice under advisement though, if I need to do a more in-depth investigation. Thanks again! – Hoopes Oct 17 '16 at 17:25
  • Another thing to try is to strace the process: http://stackoverflow.com/questions/4053142/how-to-track-child-process-using-strace – Noah Gift Oct 18 '16 at 03:55