3

I have an Amazon ELB in front of Postgres. This is for Kubernetes-related reasons, see this question. I'm trying to work around the maximum AWS ELB Idle Timeout limit of 1 hour so I can have clients that can execute long-running transactions without being disconnected by the ELB. I have no control over the client configuration in my case, so any workaround needs to happen on the server side.

I've come across the tcp_keepalives_idle setting in Postgres, which in theory should get around this by sending periodic keepalive packets to the client, thus creating activity so the ELB doesn't think the client is idle.

I tried testing this by setting the idle timeout on the ELB to 2 minutes. I set tcp_keepalives_idle to 30 seconds, which should force the server to send the client a keepalive every 30 seconds. I then execute the following query through the load balancer: psql -h elb_dns_name.com -U my_user -c "select pg_sleep(140)". After 2 minutes, the ELB disconnects the client. Why are the keepalives not coming through to the client? Is there something with pg_sleep that might be blocking them? If so, is there a better way to simulate a long running query/transaction?

I fear this might be a deep dive and I may need to bring out tcpdump or similar tools. Unfortunately things do get a bit more complicated to parse with all of the k8s chatter going on as well. So before going down this route I thought it would be good to see if I was missing something obvious. If not, any tips on how to best determine whether a keepalive is actually being sent to the server, through the ELB, and ending up at the client would be much appreciated.

Update: I reached out to Amazon regarding this. Apparently idle is defined as not transferring data over the wire. Data is defined as any network packet that has a payload. Since TCP keep-alives do not have payloads, the client and server keep-alives are considered idle. So unless there's a way to get the server to send data inside of their keep alive payloads, or send data in some other form, this may be impossible.

Lee Hampton
  • 410
  • 5
  • 13
  • Try to move on to a more scalable approach, that does not require a single long running transaction, but many smaller transactions. – Daniel W. Feb 26 '19 at 13:30
  • Unfortunately I don't have full control over that. There are many valid analytical use cases that involve long running queries and aren't inherently unscalable, I can't discount those cases. – Lee Hampton Feb 26 '19 at 15:36

1 Answers1

2

Keepalives are sent on the TCP level, well below PostgreSQL, so it doesn't make a difference if the server is running a pg_sleep or something else.

Since a hosted database is somewhat of a black box, you could try to control the behavior on the client side. The fortunate thing is that PostgreSQL also offers keepalive parameters on the client side.

Experiment with

psql 'host=elb_dns_name.com user=my_user keepalives_idle=1800' -c 'select pg_sleep(140)'
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Handling it on the client side is definitely the ideal approach. In this case unfortunately I have no control over the client. I'll update the question to reflect this. – Lee Hampton Feb 26 '19 at 13:21
  • If you have neither control over the server nor over the client, it is hard to configure anything. What exactly is the client? Does it use `libpq`? – Laurenz Albe Feb 26 '19 at 13:29
  • I have full control over the Postgres server. The only issue is that it sits behind an ELB that has an idle timeout of 60 minutes. However, the ELB has no custom logic/termination -- it should simply pass the TCP traffic through to the postgres instance, which is why I'm confused as to why the server side keepalives aren't working. The clients are very heterogenous. Some are just psql connections, some use libpq, some use various golang drivers, etc. – Lee Hampton Feb 26 '19 at 14:26
  • You could try with `psql` as I suggested and see if that helps. If not, that ELB has a funny definition of "idle". – Laurenz Albe Feb 26 '19 at 14:38
  • Very interesting -- the client side approach with psql also results in idle timeouts... it seems that keep alives on either the client or server side are not sufficient to reset the idle timeout setting on the ELB. I will update the question with that info. – Lee Hampton Feb 26 '19 at 17:03
  • Why not simply ask Amazon? – Laurenz Albe Feb 26 '19 at 17:32
  • So I reached out to Amazon regarding this. Apparently "idle" is defined as not transferring "data" over the wire. "Data" is defined as any network packet that has a payload. Since TCP keep-alives do not have payloads, the client and server keep-alives are considered idle. – Lee Hampton Feb 27 '19 at 18:28
  • That solves the mystery. Pretty annoying. You'll have to send some `SELECT 42` regularly. – Laurenz Albe Feb 27 '19 at 18:44