12

I'm working on a operator which create watch for different k8s resources. Every now and then I can see below exception in the logs and application just stop. What is causing this issue and how can I fix this ?

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 29309228 (33284573)
    at kubernetes.client@4.6.4/io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:263)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
BobCoder
  • 743
  • 2
  • 10
  • 27

2 Answers2

8

I'm from Fabric8 Kubernetes Client team. I think it's standard behavior of Kubernetes to give 410 after some time during watch. It's usually client's responsibility to handle it. In the context of a watch, it will return HTTP_GONE when you ask to see changes for a resourceVersion that is too old - i.e. when it can no longer tell you what has changed since that version, since too many things have changed. In that case, you'll need to start again, by not specifying a resourceVersion in which case the watch will send you the current state of the thing you are watching and then send updates from that point.

Fabric8 does not handle it with plain watch. But it is handling it in SharedInformer API, see ReflectorWatcher. I would recommend using informer API when writing operators since it's better than plain list and watch. Here is a simple example of Using SharedInformer API:

try (KubernetesClient client = new DefaultKubernetesClient()) {
  SharedInformerFactory sharedInformerFactory = client.informers();
  SharedIndexInformer<Pod> podInformer = sharedInformerFactory.sharedIndexInformerFor(Pod.class, PodList.class, 30 * 1000L);
  podInformer.addEventHandler(new ResourceEventHandler<Pod>() {
    @Override
    public void onAdd(Pod pod) {
      // Handle Creation
    }

    @Override
    public void onUpdate(Pod oldPod, Pod newPod) {
      // Handle update
    }

    @Override
    public void onDelete(Pod pod, boolean deletedFinalStateUnknown) {
      // Handle deletion
    }
  });
  sharedInformerFactory.startAllRegisteredInformers();
}

You can find a full demo of a simple operator using Fabric8 SharedInformer API here: PodSet Operator In Java

Rohan Kumar
  • 5,427
  • 8
  • 25
  • 40
  • nice example. What about how to watch just configMaps/secrets ? In spring boot this handle org.springframework.cloud.kubernetes.config.reload.EventBasedConfigurationChangeDetector. Problem is that they dont handle this situation when resource is too old and status is GONE and they just close watcher and config map is no longer updated in application – hudi Sep 02 '20 at 23:12
  • I think this should work if you replace Pod with ConfigMap – Rohan Kumar Sep 03 '20 at 11:20
  • nope it is not working in spring boot application. I tried: https://stackoverflow.com/questions/63719631/how-to-watch-configmap-with-sharedinformer . There is no mechanism how to propagate this new object in spring boot application – hudi Sep 03 '20 at 11:37
  • for me nothing has been changed. Sometimes I get this: too old resource version: 21382351 (21382351) – beatrice Apr 08 '22 at 07:13
0

this workaround worked for me, I hope it will help others Every time, my pod got this “too old resource” error it halted and restarted itself. I have found out that if I am creating the resources manually (in case it was CRD – even a dummy one) There are almost no “too old resource” exceptions so the operator was up and running and listening. So, what I have done:

  1. At the moment this specific error is happening: a. System error (which will restart the pod) b. Exception with text "too old resource version"
  2. Created new dummy CRD object at the platform (before the restart of the pod) a. Programmatically (fabric8), Check if dummy CRD exists. If so, delete it. b. Programmatically (fabric8), create the dummy CRD again with
  3. Then the pod restarted itself (this restart also happened before my code changes it is not because of my code)
  4. When the pod starts up it creates secret out of the dummy CRD.

From that point there were almost no restarts and the operator was up and running and listening. Just don’t forget to give permissions to the operator’s service account to create and delete those resources.