14

I'm using Promtail + Loki to collect my logs and I can't figure how can I alert for every error in my log files. I'm also using Prometheus, Alertmanager and Grafana. I've seen some people have managed to achieve that, but none of them explained the details. Just to be clear, I'm not looking for alerts that stay in FIRING state or Grafana dashboards with "Alerting" status. All I need is to know every single time an error raises up on one of my logs. In case it cannot be done exactly this way, the next best solution is to scrape for every X seconds and then alert something like: "6 new error messages".

EnTm
  • 163
  • 1
  • 1
  • 6

4 Answers4

0

With Loki v2.0 there is a new way for alerting: https://grafana.com/docs/loki/latest/alerting/

You can now configure it directly in Loki and send it to the Alertmanager.

Update:

As requested a simple example for an alert:

  groups:
  - name: NumberOfErrors
    rules:
    - alert: logs_error_count_kube_system
      expr: rate({namespace="kube-system"} |~ "[Ee]rror"[5m]) > 5
      for: 5m
      labels:
        severity: P4
        Source: Loki
Christian
  • 1,487
  • 1
  • 14
  • 11
  • 7
    This doesn't really answer the question - the Loki alerting docs don't explain how to make an alert for *every error log*, just metric queries. Have you been able to write such an alerting rule? – Isaac van Bakel Dec 18 '20 at 09:59
0

For alerting in Loki, add the rule files to the folder specified in the ruler section in your config file.

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  rule_path: /tmp/loki/rules-temp
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true
  enable_alertmanager_v2: true

If your configuration is like above, add your rules files to /etc/loki/rules/ like /etc/loki/rules/app/rules1.yaml

(/tmp/loki/rules/<tenant id>/rules1.yaml)

For alerting something like: "6 new error messages", You can use the sum(count_over_time()) or count_over_time().

If you have labels like job="error" and job="info", and a common label to both of the jobs as app="myapp", then count_over_time({app="myapp"}) will list the values for individual jobs. sum(count_over_time({app="myapp"})) will list the sum of all the values in both the jobs

Sample configuration for rules1.yaml:

groups:
  - name: logs
    rules:
      - alert: ErrorInLogs
        expr: sum(count_over_time({app="myapp"}|~ "[Ee]rror"[1m]) >= 1
        for: 10s
        labels:
          severity: critical
          category: logs
        annotations:
          title: "{{$value}} Errors occurred in application logs"

Here {{$value}} will give the count returned from the expr.

Sahit
  • 470
  • 6
  • 15
0

You could try using the mtail exporter. Mtail allows to "watch" the logs one line at a time and so you could set up a condition so that it matches an "error" log line. It would increment its internal counter whenever it detects an error log line that you'd scrape and then alert on.

bosowski
  • 124
  • 6
-1

I had the same question.

Investigating a little bit, I discovered the AlertManager just receives alerts and route them. If you have a service which can translate the Loki searchs into calls to the AlertManager API, it is done. And probably you have already two of them.

I found this thread: https://github.com/grafana/loki/issues/1753

Which contained this video: https://www.youtube.com/watch?v=GdgX46KwKqo

Option 1: Using grafana

They show how to create an alert from a search in Grafana. If you just add an Alert Notification Channel with type "Prometheus Alertmanager" you'll get it.

So, Grafana will fire the alert and Prometheus-AlertManager will manage it.

Option 2: Using promtail

There is other way: to add a promtail pipeline_stage in order to create a Prometheus Metric with your search and manage it as any other metric: just add the Prometheus alert and manage it from the AlertManager.

You can just read the example in previous link:

pipeline_stages:
  - match:
      selector: '{app="promtail"} |= "panic"'
  - metrics:
      panic_total:
        type: Counter
        description: "total number of panic"
        config:
          match_all: true
          action: inc

And you will have the prometheus metric to be managed as usual alert.

MagMax
  • 1,645
  • 2
  • 17
  • 26
  • Not ideal solution because you can't get full text of "panic message" when grafana send an alert. – Serko Oct 01 '20 at 12:20
  • The question was how to get the content of the alert triggering log-entry inside of the alert message – Greg Z. Feb 24 '21 at 13:58