1

I would like to be able to monitor (logs, performance metrics) VM's in Azure (and other clouds) using Google Cloud Logging and Monitoring.

As a proof of concept,

When I check the status of the Ops Agent, I see the following (mildly redacted)

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
    Process: 2730195 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
    Process: 2730208 ExecStart=/opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=${RUNTIME_DIRECTORY}/otel.yaml (code=exited, status=1/FAILURE)
   Main PID: 2730208 (code=exited, status=1/FAILURE)

Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Metrics Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Metrics Agent.

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
    Process: 2730194 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
    Process: 2730207 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (co>
   Main PID: 2730207 (code=exited, status=255/EXCEPTION)

Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

● google-cloud-ops-agent.service - Google Cloud Ops Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-02-16 22:39:21 UTC; 1min 7s ago
    Process: 2730090 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
    Process: 2730102 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 2730102 (code=exited, status=0/SUCCESS)

Feb 16 22:39:21 HOSTNAME systemd[1]: Starting Google Cloud Ops Agent...
Feb 16 22:39:21 HOSTNAME systemd[1]: Finished Google Cloud Ops Agent.

The Ops Agent logs show

[2022/02/16 22:39:22] [ info] [engine] started (pid=2730207)
[2022/02/16 22:39:22] [ info] [storage] version=1.1.5, initializing...
[2022/02/16 22:39:22] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2022/02/16 22:39:22] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2022/02/16 22:39:22] [ info] [storage] backlog input plugin: storage_backlog.2
[2022/02/16 22:39:22] [ info] [cmetrics] version=0.2.2
[2022/02/16 22:39:22] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2022/02/16 22:39:22] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch token from the metadata server
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] token retrieval failed
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch project id from the metadata server
[2022/02/16 22:39:22] [error] [output] failed to initialize 'stackdriver' plugin
[2022/02/16 22:39:22] [ info] [input] pausing fluentbit_metrics.0
[2022/02/16 22:39:22] [ info] [input] pausing tail.1
[2022/02/16 22:39:22] [ info] [input] pausing storage_backlog.2

I notice private_key is not defined, fetching it from metadata server, which suggests that the key file is not being picked up.

The documentation says The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances. See here.

Can the Ops Agent only be run on Compute Engine instances or is it reasonable to expect that it could be run anywhere if properly configured?

commander.trout
  • 487
  • 6
  • 14

2 Answers2

3

When google-cloud-ops-agent.service is started, it starts google-cloud-ops-agent-fluent-bit.service and google-cloud-ops-agent-opentelemetry-collector.service and then exits. Environment variables added as overrides to google-cloud-ops-agent.service do not persist to the others.

I found that I had to add GOOGLE_APPLICATION_CREDENTIALS to google-cloud-ops-agent-opentelemetry-collector.service and GOOGLE_SERVICE_CREDENTIALS to google-cloud-ops-agent-fluent-bit.service. You can override the systemd units non-interactively:

SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-fluent-bit.service <<'EOF'
[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF

SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-opentelemetry-collector.service <<'EOF'
[Service]
Environment='GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF
Alan Ivey
  • 917
  • 9
  • 20
2

Ops Agent is looking for credentials and not finding them.

This means you either did not copy the service account to the correct location with the correct file access permissions OR you did not set up the environment variable GOOGLE_APPLICATION_CREDENTIALS correctly with the correct file access permissions.

The agent then checks the metadata service which does not support Google OAuth access tokens (Azure provides MSI credentials if setup)

John Hanley
  • 74,467
  • 6
  • 95
  • 159
  • I followed the instructions [here](https://cloud.google.com/monitoring/agent/ops-agent/authorization) and the key file is where it is supposed to be with required ownership and permissions. ```$ ls -la /etc/google/auth/application_default_credentials.json -r-------- 1 root root 2357 Feb 16 03:18 /etc/google/auth/application_default_credentials.json``` I've restarted the service and even restarted the instance with the same result. – commander.trout Feb 17 '22 at 01:26
  • It's not clear to me how to set env var `GOOGLE_APPLICATION_CREDENTIALS` such that a service (root process) has access to its value. – commander.trout Feb 17 '22 at 01:30
  • @commander.trout Double-check that the file is actually a service account JSON key file. Environment variables are set in the service's configuration file. However, based on the error messages, I do not think the service is finding the file. I do not recommend referring us to a link. Show the exact steps you followed in your question. Links break, are modified, etc rendering your question less useful in the future. – John Hanley Feb 17 '22 at 01:49
  • @commander.trout - Login as a normal user and execute these commands: **sudo cat $GOOGLE_APPLICATION_CREDENTIALS** and **sudo cat /etc/google/auth/application_default_credentials.json** Both commands should display the contents of the JSON key file. – John Hanley Feb 17 '22 at 01:50
  • Point taken on posting links. `sudo cat $GOOGLE_APPLICATION_CREDENTIALS` does nothing because the environment variable is not set. `sudo cat /etc/google/auth/application_default_credentials.json` does indeed print the contents of a JSON key file (as expected). – commander.trout Feb 17 '22 at 02:08
  • 1
    I created a service configuration file by running `sudo systemctl edit google-cloud-ops-agent` and added ```[Section]Environment=GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json```. This resulted in a file at `/etc/systemd/system/google-cloud-ops-agent.service.d/override.conf`. I then ran `systemctl daemon-reload`, `systemctl restart google-cloud-ops-agent`, but again got the same result. – commander.trout Feb 17 '22 at 02:15