CloudWatch does not aggregate across dimensions for your custom metrics

Question

Reading the docs I saw this statement;

CloudWatch does not aggregate across dimensions for your custom metrics

That seems like a HUGE limitation right? It would make custom metrics all but useless in my estimation- so I want to confirm I'm understanding this.

For example say I had a custom metric I shipped from multiple servers. I want to see per server but I also want to see them all together. I would have no way of aggregating that accross all the servers? Or would i be forced to create two custom metrics, one for single server and one for all server and double post metrics from the servers to the per server one AND the one for aggregating all of them?

I was trying to group metrics by dimensions for quite some time and already started to think that either I'm stupid or... You've confirmed my last suspicion. AWS continues to disappoint me almost everywhere, looks like it's time to consider some other providers. The lack for such basic statistics feature is just enough for me. — Slava Fomin II, Jun 25 '21 at 08:59

Dejan Peretin · Accepted Answer · 2019-12-17T01:48:43.880

The docs are correct, CloudWatch won't aggregate across dimensions for your custom metrics (it will do so for some metrics published by other services, like EC2).

This feature may seem useful and clear for your use-case but it's not clear how such aggregation would behave in a general case. CloudWatch allows for up to 10 dimensions so aggregating for all combinations of those may result in a lot of useless metrics, for all of which you would be billed. People may use dimensions to split their metrics between Test and Prod stacks for example, which are completely separate and aggregating those would not make sense.

CloudWatch is treating a metric name plus a full set of dimensions as a unique metric identifier. In your case, this means that you need to publish your observations for each metric you want it contributing to separately.

Let's say you have a metric named Latency, and you're putting a hostname in a dimension called Server. If you have three servers this will create three metrics:

Latency, Server=server1
Latency, Server=server2
Latency, Server=server3

So the approach you mentioned in your question will work. If you also want a metric showing the data across all servers, each server would need to publish to a separate metric, which would be best to do by using a new common value for the Server dimension, something like AllServers. This will result in you having 4 metrics, like this:

Latency, Server=server1 <- only server1 data
Latency, Server=server2 <- only server2 data
Latency, Server=server3 <- only server3 data
Latency, Server=AllServers <- data from all 3 servers

Update 2019-12-17

Using metric math SEARCH function: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

This will give you per server latency and latency across all servers, without publishing a separate AllServers metric and if a new server shows up, it will be automatically picked up by the expression:

Graph source:

{
    "metrics": [
        [ { "expression": "SEARCH('{SomeNamespace,Server} MetricName=\"Latency\"', 'Average', 60)", "id": "e1", "region": "eu-west-1" } ],
        [ { "expression": "AVG(e1)", "id": "e2", "region": "eu-west-1", "label": "All servers", "yAxis": "right" } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "region": "eu-west-1"

}

Result will be a graph like this:

Downsides of this approach:

Expressions are limited to 100 metrics.
Overall aggregation is limited to available metric math functions, which means percentiles are not available as of 2019-12-17.

Using Contributor Insights (open preview as of 2019-12-17): https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html

If you publish your logs to CloudWatch Logs in JSON or Common Log Format (CLF), you can create rules that keep track of top contributors. For example, a rule that keeps track servers with latencies over 400 ms would look something like this:

{
    "Schema": {
        "Name": "CloudWatchLogRule",
        "Version": 1
    },
    "AggregateOn": "Count",
    "Contribution": {
        "Filters": [
            {
                "Match": "$.Latency",
                "GreaterThan": 400
            }
        ],
        "Keys": [
            "$.Server"
        ],
        "ValueOf": "$.Latency"
    },
    "LogFormat": "JSON",
    "LogGroupNames": [
        "/aws/lambda/emf-test"
    ]
}

Result is a list of servers with most datapoints over 400 ms:

Bringing it all together with CloudWatch Embedded Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html

If you publish your data in CloudWatch Embedded Format you can:

Easily configure dimensions, so you can have per server metrics and overall metric if you want.
Use CloudWatch Logs Insights to query and visualise your logs.
Use Contributor Insights to get top contributors.

"may result in a lot of useless metrics, for all of which you would be billed". If CloudWatch supported server-side on-the-fly mathematics for aggregation, there would be no need for extra metrics and more costs. What you describe is an ok workaround. But consider a common case with say 3 dimensions with each 10 values = 30 metrics. To have ad-hoc flexibility I would need to store 3*10*10 = 300 metrics (all combination from 2 dim. with 1 "All" for the third). This is factor 10 and hence my costs are increased by factor 10. — Max, May 24 '18 at 07:53
CloudWatch recently launched Metric Math feature: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html This can be used to aggregate metrics on the server-side on GET. Wildcards are not supported, so you still have to list all the metrics you want aggregated, but you don't have to dual publish anymore. — Dejan Peretin, May 24 '18 at 22:16
CloudWatch also launched SEARCH with metric math, so wildcards can be used now. I'll update the answer to reflect the current state of the matter. — Dejan Peretin, Dec 04 '19 at 21:47
It's not hard to do - and if customers want it, CW should support it. Treating {metric + dimensions} as a unique identifier and not aggregating across the dimensions makes dimensionality nearly pointless. Why bother? Here's a great use case: I want to emit a failure metric for an algorithm with a dimension X. I want to alarm on any failures. However, I want operators to be able to deep dive into which algorithm flavor X triggered the failure. I do NOT want to have to keep my CFN stack updated every time someone adds a new algorithm (or worse, say the dimension value is dynamic). — Jmoney38, Dec 16 '19 at 22:38
@LessQuesar True, CloudWatch docs (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) say: "You can't create an alarm based on a SEARCH expression. This is because search expressions return multiple time series, and an alarm based on a math expression can watch only one time series." However they also say: "The SUM of an array of time series returns a single time series." So I'm hopeful you could use `SEARCH()` + `SUM()` to create Cloudwatch alarms. — Kalinda Pride, Oct 05 '22 at 00:01
You can do this with Metric Insights which now also supports alarms. https://aws.amazon.com/about-aws/whats-new/2022/12/amazon-cloudwatch-metrics-insights-alarms/ — jaredcnance, Dec 23 '22 at 16:26

CloudWatch does not aggregate across dimensions for your custom metrics

1 Answers1

Update 2019-12-17

Linked