Getting accurate graphite stats_counts

Question

We have etsy/statsd node application running that flushes stats to carbon/whisper every 10 seconds. If you send 100 increments (counts), in the first 10 seconds, graphite displays them properly, like:

localhost:3000/render?from=-20min&target=stats_counts.test.count&format=json

[{"target": "stats_counts.test.count", "datapoints": [
 [0.0, 1372951380], [0.0, 1372951440], ... 
 [0.0, 1372952460], [100.0, 1372952520]]}]

However, 10 seconds later, and this number falls to 0, null and or 33.3. Eventually it settles at a value 1/6th of the initial number of increments, in this case 16.6.

/opt/graphite/conf/storage-schemas.conf is:

[sixty_secs_for_1_days_then_15m_for_a_month]
pattern = .*
retentions = 10s:10m,1m:1d,15m:30d

I would like to get accurate counts, is graphite averaging the data over the 60 second windows rather than summing it perhaps? Using the integral function, after some time has passed, obviously gives:

localhost:3000/render?from=-20min&target=integral(stats_counts.test.count)&format=json

[{"target": "stats_counts.test.count", "datapoints": [
 [0.0, 1372951380], [16.6, 1372951440], ... 
 [16.6, 1372952460], [16.6, 1372952520]]}]

score 32 · Answer 1 · answered Jul 05 '13 at 04:42

Graphite data storage

Graphite manages the retention of data using a combination of the settings stored in storage-schemas.conf and storage-aggregation.conf. I see that your retention policy (the snippet from your storage-schemas.conf) is telling Graphite to only store 1 data point for it's highest resolution (e.g.10s:10m) and that it should manage the aggregation of those data points as the data ages and moves into the older intervals (with the lower resolution defined - e.g. 1m:1d). In your case, the data crosses into the next retention interval at 10 minutes, and after 10 minutes the data will roll up according the settings in the storage-aggregation.conf.

Aggregation / Downsampling

Aggregation/downsampling happens when data ages and falls into a time interval that has lower resolution retention specified. In your case, you'll have been storing 1 data point for each 10 second interval but once that data is over 10 minutes old graphite now will store the data as 1 data point for a 1 minute interval. This means you must tell graphite how it should take the 10 second data points (of which you have 6 for the minute) and aggregate them into 1 data point for the entire minute. Should it average? Should it sum? Depending on the type of data (e.g. timing, counter) this can make a big difference, as you hinted at in your post.

By default graphite will average data as it aggregates into lower resolution data. Using average to perform the aggregation makes sense when applied to timer (and even gauge) data. That said, you are dealing with counters so you'll want to sum.

For example, in storage-aggregation.conf:

[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

UI (and raw data) aggregation / downsampling

It is also important to understand how the aggregated/downsampled data is represented when viewing a graph or looking at raw (json) data for different time periods, as the data retention schema thresholds directly impact the graphs. In your case you are querying render?from=-20min which crosses your 10s:10m boundary.

Graphite will display (and perform realtime downsampling of) data according to the lowest-resolution precision defined. Stated another way, it means if you graph data that spans one or more retention intervals you will get rollups accordingly. An example will help (assuming the retention of: retentions = 10s:10m,1m:1d,15m:30d)

Any graph with data no older than the last 10 minutes will be displaying 10 second aggregations. When you cross the 10 minute threshold, you will begin seeing 1 minute worth of count data rolled up according to the policy set in the storage-aggregation.conf.

Summary / tldr;

Because you are graphing/querying for 20 minutes worth of data (e.g. render?from=-20min) you are definitely falling into a lower precision storage setting (i.e. 10s:10m,1m:1d,15m:30d) which means that aggregation is occurring according to your aggregation policy. You should confirm that you are using sum for the correct pattern in the storage-aggregation.conf file. Additionally, you can shorten the graph/query time range to less than 10min which would avoid the dynamic rollup.

When viewing `render?from=-10min` it works as expected so you're spot on there, thanks. However in `storage-aggregation.conf` I have those lines for summing `.count` metrics so it seems the dynamic/permanent aggregation by graphite/carbon (? I'm not quite sure who does the permanent downsampling) are ignoring this. I doubt it's a bug in graphite (v0.9.10), any advice on how / what might be at fault. I stopped and restarted the carbon-cache.py. Do I need to do the same to graphite for the changes to take effect? — AJP, Jul 05 '13 at 13:11
If you changed the schema or aggregation settings after the metric was stored (in whisper = graphite's storage) you'll need to either delete the .wsp files for the metric (graphite will recreate them) or run whisper-resize.py. You can verify the settings by looking at some whisper data by running whisper-info.py against a .wsp file. Find the .wsp file for one of your metrics in /graphite/storage/whisper/ and validate the settings. Run: `whisper-info.py my_metric_data.wsp`. whisper-info.py output should tell you more about how the storage settings are working. — Matt Self, Jul 05 '13 at 15:50
Please help me with http://stackoverflow.com/questions/20433697/graphite-returning-incorrect-datapoint — GJain, Dec 06 '13 at 21:34

Getting accurate graphite stats_counts

1 Answers1

Linked