0

I am getting an error running a program in my docker container, that makes requests to third party apis. The error is 'getaddrinfo EAI_AGAIN'. From my research of existing questions, it appears that this is probably due to some sort of DNS resolution error

What's the cause of the error 'getaddrinfo EAI_AGAIN'?

For more context, I am running a docker container in a google cloud compute engine vm (container optimized os). The entrypoint to the docker image is starting a cronjob that runs the program once, every day, at noon. By the time that noon rolls around the next day, looking in the logs, I see nearly all of the requests timing out with the EAI_AGAIN error, and I am not able to ssh into my compute engine vm at all (it hangs).

My prevailing theory now is that there is some sort of network change that takes place between when the container is started to when the cronjob runs, causing DNS resolution failures, but I'm not sure if this is correct and my knowledge of docker networking is limited, and also if its related to crontab or even google compute engine. If anyone has any more info please share. I am trying to figure out how I can fix this problem and either prevent these network issues or which code to run to fix them when they happen automatically.

Davis Owen
  • 57
  • 1
  • 9
  • The item that you cannot ssh into the instance indicates either high CPU usage or thrashing because of no free disk space. The system's log files should tell you what is going wrong. – John Hanley May 18 '23 at 22:28
  • @JohnHanley Certainly possible because I am using a e2-micro instance, but with 50GB of disk space. However, I checked the logs by looking in the logging section of the google cloud console and theres no logs showing any errors or warnings or anything at all when I was unable to ssh in. Based on the logs from my script, it is trying to do dns lookup for a lot of consecutive (synchronous) api calls and each of them is timing out, so not sure if that would be using up all the e2-micro's resources or not – Davis Owen May 18 '23 at 22:52
  • You need to review the system's logs. That is where the problems will be recorded if the instance is stable enough to do so. – John Hanley May 19 '23 at 01:31
  • I was checking what appears to be the audit logs to no avail https://cloud.google.com/compute/docs/logging/audit-logging – Davis Owen May 19 '23 at 16:23
  • You need to review the logs within the VM instance. If your VM is thrashing or hanging, the system might not be able to send logs to Cloud Logging. – John Hanley May 19 '23 at 16:25
  • I'm looking at journalctl now at the advice of this post https://stackoverflow.com/questions/56910528/container-optimized-os-syslog-location I see a lot of sshd attempts from unauthorized users to connect (probably expected), then at a certain point, `eth0: could not set dhcpv4 route: connection timed out`, `beginning MaxStartups throttling`, then a lot of 'network unreachable' errors from device_policy_manager, OSConfigAgent network error, and dockerd rpc error `transport is closing`. So lots of network errors starting for some reason – Davis Owen May 19 '23 at 16:45
  • The VM's network is failing. That is often caused by either 100% CPU or no free memory. What size instance are you running? Most likely you must choose a larger instance size. – John Hanley May 19 '23 at 16:57
  • e2-micro. I can try upgrading to e2-small or medium. Thanks – Davis Owen May 19 '23 at 17:19

1 Answers1

0

This was simply a byproduct of my Google Compute Engine VM's network failing, which was due to inadequate resources (CPU or memory). Increasing the instances size from e2-micro to e2-small did the trick

Davis Owen
  • 57
  • 1
  • 9