Grafana Timeout while querying large amount of logs from Loki

Question

I have a Loki server running on AWS Graviton (arm, 4 vCPU, 8 GiB) configured as following:

common:
  replication_factor: 1
  ring:
    kvstore:
      store: etcd
      etcd:
        endpoints: ['127.0.0.1:2379']

storage_config:
  boltdb_shipper:
   active_index_directory: /opt/loki/index
   cache_location: /opt/loki/index_cache
   shared_store: s3

  aws:
    s3: s3://ap-south-1/bucket-name

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h # 7d
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 8MB
  
ingester:
  lifecycler:
    join_after: 30s
  chunk_block_size: 10485760

compactor:
  working_directory: /opt/loki/compactor
  shared_store: s3
  compaction_interval: 5m

schema_config:
  configs:
    - from: 2022-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: loki_
        period: 24h

table_manager:
  retention_period: 360h #15d
  retention_deletes_enabled: true
  index_tables_provisioning: # unused
    provisioned_write_throughput: 500
    provisioned_read_throughput: 100
    inactive_write_throughput: 1
    inactive_read_throughput: 100

Ingestion is working fine and I'm able to query logs for long durations from streams with less data sizes. I'm also able to query small durations of logs for streams with TiBs of data.

I see the following error in Loki when I try to query 24h of data from a large data stream and Grafana timeout after 5 mins:

Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.186137309Z caller=retry.go:73 org_id=fake msg="error processing request" try=2 err="context canceled"
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.186304708Z caller=metrics.go:92 org_id=fake latency=fast query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=0s status=499 limit=1000 returned_lines=0 throughput=0B total_bytes=0B
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.23882892Z caller=metrics.go:92 org_id=fake latency=slow query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=59.813829694s status=400 limit=1000 returned_lines=153 throughput=326MB total_bytes=20GB
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.238959314Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=192.168.5.138:9095
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.23898877Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=192.168.5.138:9095

Query: {filename="/var/log/server.log",host="web-199",ip="192.168.20.239",name="web"} |= "attachDriver"

Is there a way to stream the results instead of waiting for the response? Can I optimize Loki to process such queries better?

same error here, I can return 24h of data with this query: `sum by (request_http_host) (rate({env="qa"} |= "response_status" |~ "5.." [1m]))` just not using `json` filter, but, more time period will fail with this errors: `level=error ts=2022-02-21T11:08:08.143775302Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"` — kiba, Feb 21 '22 at 11:12
The `split_queries_by_interval` configuration at https://grafana.com/docs/loki/latest/configuration/ solved this issue for me. Loki was unable to start when I added this option to the configuration file for some reason, so I added it to my systemd unit file by changing `ExecStart` as follows: `ExecStart=/usr/local/bin/loki -config.file /etc/loki/loki.yml -querier.split-queries-by-interval 24h`. My Loki responses are also now much faster after adding this. — Ashley Kleynhans, Dec 28 '22 at 14:02
Setting split_queries_by_interval to 24h did not solve the problem for me. — mac13k, Jul 28 '23 at 13:09

score -1 · Answer 1 · answered Jul 08 '23 at 05:55

Grafana Loki may work slowly when querying large log streams, since it needs to scan all the log messages in the stream in order to find log messages with the requested substring. This issue can be solved in the following ways:

By storing Loki data on faster disks with higher disk read bandwidth. This may improve query performance if it is limited by disk read speed.
By running Loki on hosts with more RAM, so more data could be read from Operating system page cache, e.g. from fast RAM instead of slow disk.
By running Loki on hosts with higher number of CPU cores if the query performance is limited by CPU.
By manually splitting the query over a big time range into multiple queries over smaller time ranges.

P.S. There is an alternative log database, which may provide much faster query performance over large log streams - VictoriaLogs (I work on it). It also provides response streaming and good integration with command-line tools for logs analysis and debugging such as head, less, grep, awk, etc. See these docs.

Very generic advises and not true when Loki uses S3 storage backend - is such case query performance greatly depends on the throughput of the storage backend regardless of CPU and RAM available to Loki hosts. — mac13k, Jul 28 '23 at 13:12

Grafana Timeout while querying large amount of logs from Loki

1 Answers1