1

We struggling with a very weird issue on google's VMs (It doesn't happen locally by very similar deployment). We have a docker-compose deployment with a couple of microservices. They are communicating together via rabbitMQ. Most of them has also communication with postgres DB (on one VM we are using db in container on another we use google sql service). When we trigger one of our functionality which engage a 3-4 services and DB queries - the whole flow takes less than 1s. But after some time of inactivity (we didn't identified exactly which time but seems like more than 30mins) the same flow takes 60 secs. What we observed is exactly 20 secs lags when services are communicating or when there is a query to DB. And it happens only by first request after the inactivity time.

When we observe tcp connection on one of our service everything seems fine like:

$ sudo nsenter -t 9389 -n netstat -tnope
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name     Timer
tcp        0      0 192.168.192.10:59372    192.168.192.11:5672     ESTABLISHED 0          37791595   10732/python         keepalive (38.80/0/0)
tcp        0      0 192.168.192.10:59374    192.168.192.11:5672     ESTABLISHED 0          37791599   10732/python         keepalive (38.57/0/0)
tcp        0      0 192.168.192.10:59376    192.168.192.11:5672     ESTABLISHED 0          37791604   10732/python         keepalive (37.94/0/0)
tcp        0      0 192.168.192.10:59402    192.168.192.11:5672     ESTABLISHED 0          37792668   10732/python         keepalive (38.07/0/0)
tcp        0      0 192.168.192.10:50634    10.89.128.3:5432        ESTABLISHED 0          37793882   10732/python         keepalive (38.13/0/0)

they are established all the time, even when the issue happens. We tried to change keep_alive system settings (to be 60secs instead of def 7200) but it didn't help. It definitely looks like the opened TCP connections (like rabbitMQ channels or DB connections pool) are timeouting but we cannot find where they timeouts - like as we mentioned above - netstat doesn't see a break in connection. We were tracking with TCPDUMP too and no errors was shown here. We are not getting also errors on application layer. Nothing is wrong but those 20 secs lags. Only one evidence we could find which confirm our guess of TCP timeout is this command:

$ sudo nsenter -t 9389 -n netstat -s
Ip:
    Forwarding: 1
    1182 total packets received
    0 forwarded
    0 incoming packets discarded
    1182 incoming packets delivered
    1221 requests sent out
Icmp:
    0 ICMP messages received
    0 input ICMP message failed
    ICMP input histogram:
    0 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
Tcp:
    30 active connection openings
    0 passive connection openings
    12 failed connection attempts
    0 connection resets received
    5 connections established
    1068 segments received
    1099 segments sent out
    8 segments retransmitted
    0 bad segments received
    1 resets sent
Udp:
    114 packets received
    0 packets to unknown port received
    0 packet receive errors
    114 packets sent
    0 receive buffer errors
    0 send buffer errors
UdpLite:
TcpExt:
    5 TCP sockets finished time wait in fast timer
    33 delayed acks sent
    Quick ack mode was activated 2 times
    243 packet headers predicted
    169 acknowledgments not containing data payload received
    267 predicted acknowledgments
    TCPLostRetransmit: 7
    TCPTimeouts: 10
    TCPDSACKOldSent: 2
    1 connections reset due to unexpected data
    1 connections aborted due to timeout
    TCPRcvCoalesce: 27
    TCPSpuriousRtxHostQueues: 1
    TCPOrigDataSent: 391
    TCPKeepAlive: 118
    TCPDelivered: 407
IpExt:
    InOctets: 107606
    OutOctets: 107275
    InNoECTPkts: 1192

In the above stats we can observe that whenever problem happens the field-counter 'TCPTimeouts' grows. So, we don't know the google internal architecture but it looks like the tcp connection might be closed on some intermediate layers. So from app layer (connected inside docker-compose network) everything seems fine but definitely somewhere tcp connection timeouting and new one is created in silent way.

Technology stack we are using is: python, nameko, psql, rabbitmq. So regular stuff, but it is not much relevant I guess to the problem.

Anyone experience similar issues on Google's GCP?

Krzysiek
  • 599
  • 4
  • 14
  • Can you confirm how you are using Docker, is it running standalone on a VM instance, or is it being orchestrated by Kubernetes. And if so, can you provide Kubelet logs? – Anurag Sharma Oct 24 '20 at 19:48
  • 1
    @AnuragSharma It is orchestrated by docker-compose with default network. – Krzysiek Oct 26 '20 at 10:23

1 Answers1

0

I believe the behavior is related to the way the applications configure their connections to the database servers. I have found this link which indicates "Limiting a connection's lifetime can help prevent abandoned connections from accumulating. You can use the connection pool to limit your connection lifetimes." It looks like the default value for "pool_recycle" is 30 minutes which kind of matches the estimation you have provided. Properly adjusting these settings might help you fix the issue or find its cause. This doc also explains how this works when creating a connection string for a Ms SQL Server, note that this might not be the your case but it helps understanding how it works with any database server in general. Additionally, the 60s timeout they mentioned after which the normal connectivity resumed, is likely due to how read packets are handled inside an SQL server described in this link. The link is specific to MsSQL, however the behavior should be true for any flavor of SQL server (see the second note in blue). You may also check out this other thread which discusses a similar deployment.

  • Hmm but it doesn't look like the sources you mentioned could cause such problem - for instance after pool_recycle time passed the connection should be reestablished (not hanged for 20sec), after connectionTimout it should throw error but in my case everything happens silenghtly, no errors, just hangs and then it works fine after some time. However, I tried a little bit, I set: "pool_recycle": 3600*5, "pool_pre_ping": True, but didn't help. I think it doesn't comes also to DB but rabbitmq messages I get also after 20sec break on destination. – Krzysiek Nov 06 '20 at 14:20