We have decided to update our infrastructure to use Jenkins with the Kubernetes plugin. We have provisioned a 4-node hybrid cluster the following way:
- 2 nodes on-premises: A and B
- 2 nodes on AWS: C and D
In a datacenter, latencies are very low (< 1ms): A-B or C-D.
Across the internet, latencies are very high (75ms): A-C, A-D, B-C, B-D.
We have noticed that when high latency is present between a Jenkins master and a Jenkins agent, the job appears to wait for 45 seconds, from the time the job work completes, to when the job shows "success". This delay only happens for the first job. When the latency is low, this delay is less than 2 seconds. The times are completely repeatable and the available bandwidth is plenty (>50 Mb/s).
What seems to be a surprise is that the extra time only happens after the job has completed, regardless of how much time the job takes. Adding to this, a packet capture shows 25 MB of data from the master to the agent during the first job execution.
We were able to replicate this behavior by running both master and agent on-premises and injecting the latency via tc.
Has anyone encountered a similar issue, and what kind of solution was applied?
It appears as if Jenkins sends some data over at the end and that an RPC framework is doing sequential calls that are latency-bound. But what is the data, and is there some way to avoid it? We've stripped away all non-Kubernetes plugins to test the theory that it might be post-build plugin code, but that had no effect.