2

I'd like to monitor in Prometheus number of cadence workflows currently running.

I checked metrics which are exported by different cadence services (like cadence_history, cadence_worker, cadence_frontend and so on) and the only workflows-related metrics I could find were:

  • activity_end_to_end_latency histogram (workflowType is one of the labels)
  • workflow_success counter / workflow_terminate counter / workflow_failed counter

So it seems that there are metrics to analyze already completed workflows, but no information about current ones. Am I right or I missed something?

It means that I have to export needed metrics on my own and I see 2 possible solutions:

  • create a gauge and increment/decrement it when on start and stop of my workflow, for example:
func MyWorkflow(ctx workflow.Context) error {
    mymetrics.gauge.Inc()

    if err := workflow.ExecuteActivity(ctx, someActivity).Get(ctx, nil); err != nil {
        mymetrics.gauge.Dec()
        return err
    }

    // ...

    mymetrics.gauge.Dec()
    return nil
}

The disadvantage of this approach is that workflows terminated manually by the user will not be measured correctly.

  • create a prometheus exporter and use cadence.client.ListOpenWorkflow function to collect number of running workflows. However, the cadence docs says that "heavy usage of this API may cause huge persistence pressure", so I suppose that's a very bad idea to call it inside a prometheus exporter.

Do you see any other possible solutions?

trivelt
  • 1,913
  • 3
  • 22
  • 44

1 Answers1

2

So it seems that there are metrics to analyze already completed workflows, but no information about current ones. Am I right or I missed something?

That's right. There is no metrics like that emitted.

However, the cadence docs says that "heavy usage of this API may cause huge persistence pressure"

That's not true if you use Advanced Visibility with ElasticSearch.

But if using Advanced Visibility, you should use "CountWorkflowExecution" API instead. This will be more efficient to count the open workflows.

If using basic visibility, it could be a perf problem to persistence if the number is ver large. Because you have to iterate over pages to get the number.

Long Quanzheng
  • 2,076
  • 1
  • 10
  • 22