fluentd exec_filter output fails to recover after OOM

Question

I'm using fluentd in docker (alpine image) to collect messages from gelf input. Running it using docker-compose.

In the output, I need to send the messages to a 3rd party using a python SDK, and I need the output to be synchronous, i.e. have only one output script running at a time.

so I wanted to use an exec out plugin to run the python script.

<match logs.*>
  @type exec
  command python3 /src/output.py
  format json
  <buffer>
    @type file
    path /var/log/buffer
    flush_interval 1s
  </buffer>
</match>

Problem with that is that its only mode is asynchronous - meaning it doesn't wait for the output to end before launching a new output, so when I have high throughput I have multiple outputs running.

Later on I found the exec_filter which is used for filtering, but I figured I can use it to my scenario since its synchronous. It does run only one process of output and getting the events via stdin.

When I did some performance testing (with memory limit and disabling swap) - I got an OOM on the container (limiting it to 128mb).

The problem is - when it restarted after the OOM - it didn't run the output.py script, and the chunks are failing since there is no output running. Docker logs:

That's the OOM I assume:

2023-04-14 10:45:22 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2023-04-14 10:45:24 +0000 chunk="5f949677e6035c6b397762e21f7ae37d" error_class=IOError error="stream closed in another thread"
  2023-04-14 10:45:22 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/buffer/memory_chunk.rb:86:in `write'
  2023-04-14 10:45:22 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/buffer/memory_chunk.rb:86:in `write_to'
2023-04-14 10:45:22 +0000 [warn]: #0 child process exits with error code code=9 status=nil signal=9

And then I get this constantly:

2023-04-14 10:45:23 +0000 [warn]: #0 failed to flush the buffer. retry_times=1 next_retry_time=2023-04-14 10:45:26 +0000 chunk="5f949677e6035c6b397762e21f7ae37d" error_class=RuntimeError error="no healthy child processes exist"
  2023-04-14 10:45:23 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/out_exec_filter.rb:282:in `write'
  2023-04-14 10:45:23 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2023-04-14 10:45:23 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2023-04-14 10:45:23 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2023-04-14 10:45:23 +0000 [warn]: #0 /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

I didn't find any more logs to help me investigate.

So 2 questions:

For my scenario (synchronous output), is there a better fit then to use the exec_filter?
How can I further investigate the output.py not running?

fluentd exec_filter output fails to recover after OOM

0 Answers0