We struggling with a very weird issue on google's VMs (It doesn't happen locally by very similar deployment). We have a docker-compose deployment with a couple of microservices. They are communicating together via rabbitMQ. Most of them has also communication with postgres DB (on one VM we are using db in container on another we use google sql service). When we trigger one of our functionality which engage a 3-4 services and DB queries - the whole flow takes less than 1s. But after some time of inactivity (we didn't identified exactly which time but seems like more than 30mins) the same flow takes 60 secs. What we observed is exactly 20 secs lags when services are communicating or when there is a query to DB. And it happens only by first request after the inactivity time.
When we observe tcp connection on one of our service everything seems fine like:
$ sudo nsenter -t 9389 -n netstat -tnope
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 192.168.192.10:59372 192.168.192.11:5672 ESTABLISHED 0 37791595 10732/python keepalive (38.80/0/0)
tcp 0 0 192.168.192.10:59374 192.168.192.11:5672 ESTABLISHED 0 37791599 10732/python keepalive (38.57/0/0)
tcp 0 0 192.168.192.10:59376 192.168.192.11:5672 ESTABLISHED 0 37791604 10732/python keepalive (37.94/0/0)
tcp 0 0 192.168.192.10:59402 192.168.192.11:5672 ESTABLISHED 0 37792668 10732/python keepalive (38.07/0/0)
tcp 0 0 192.168.192.10:50634 10.89.128.3:5432 ESTABLISHED 0 37793882 10732/python keepalive (38.13/0/0)
they are established all the time, even when the issue happens. We tried to change keep_alive system settings (to be 60secs instead of def 7200) but it didn't help. It definitely looks like the opened TCP connections (like rabbitMQ channels or DB connections pool) are timeouting but we cannot find where they timeouts - like as we mentioned above - netstat doesn't see a break in connection. We were tracking with TCPDUMP too and no errors was shown here. We are not getting also errors on application layer. Nothing is wrong but those 20 secs lags. Only one evidence we could find which confirm our guess of TCP timeout is this command:
$ sudo nsenter -t 9389 -n netstat -s
Ip:
Forwarding: 1
1182 total packets received
0 forwarded
0 incoming packets discarded
1182 incoming packets delivered
1221 requests sent out
Icmp:
0 ICMP messages received
0 input ICMP message failed
ICMP input histogram:
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
Tcp:
30 active connection openings
0 passive connection openings
12 failed connection attempts
0 connection resets received
5 connections established
1068 segments received
1099 segments sent out
8 segments retransmitted
0 bad segments received
1 resets sent
Udp:
114 packets received
0 packets to unknown port received
0 packet receive errors
114 packets sent
0 receive buffer errors
0 send buffer errors
UdpLite:
TcpExt:
5 TCP sockets finished time wait in fast timer
33 delayed acks sent
Quick ack mode was activated 2 times
243 packet headers predicted
169 acknowledgments not containing data payload received
267 predicted acknowledgments
TCPLostRetransmit: 7
TCPTimeouts: 10
TCPDSACKOldSent: 2
1 connections reset due to unexpected data
1 connections aborted due to timeout
TCPRcvCoalesce: 27
TCPSpuriousRtxHostQueues: 1
TCPOrigDataSent: 391
TCPKeepAlive: 118
TCPDelivered: 407
IpExt:
InOctets: 107606
OutOctets: 107275
InNoECTPkts: 1192
In the above stats we can observe that whenever problem happens the field-counter 'TCPTimeouts' grows. So, we don't know the google internal architecture but it looks like the tcp connection might be closed on some intermediate layers. So from app layer (connected inside docker-compose network) everything seems fine but definitely somewhere tcp connection timeouting and new one is created in silent way.
Technology stack we are using is: python, nameko, psql, rabbitmq. So regular stuff, but it is not much relevant I guess to the problem.
Anyone experience similar issues on Google's GCP?