Debugging sidekiq worker memory leak with jemalloc

Question

So, I've got a memory leak in my Sidekiq worker. I have a worker server with one queue for this worker task only, that gets to about 10G RSS in a week.

I tried to reproduce it locally with only 1 worker thread and voila - I get from 200M to 1G in one night, processing 1 task/min. Naturally, I want to know what's leaking, so I'm also logging RSS, heap_live_slots and heap_free_slots. When I plot results I can see steady RSS growth while live and free slots fluctuate randomly but in well defined and constant boundaries, while their sum remains constant.

At this point I come to the conclusion that the leak likely occurs not in Ruby code, but rather in some native extension. So I reinstall ruby with Jemalloc support via RVM: rvm reinstall 2.4.2 --with-jemalloc

Then I set up MALLOC_CONF:

export MALLOC_CONF='prof_leak:true,lg_prof_sample:0,prof_final:true,stats_print:true'

And fire up Sidekiq. Freshly started Sidekiq with 1 worker thread is worth about 200M RSS, but when I press Ctrl+C and look at stats output of jemalloc I see something completely different:

Arenas: 32
Quantum size: 16
Page size: 4096
Maximum thread-cached size class: 32768
Allocated: 34056, active: 61440, metadata: 2949272, resident: 2981888, mapped: 6352896, retained: 2035712

What? 6M mapped? This cannot be true. So I start irb and do the following:

2.4.2 :001 > arr = []
 => [] 
2.4.2 :002 > loop do
2.4.2 :003 >   arr << 'a'*10000000
2.4.2 :004?>   sleep 1
2.4.2 :005?> end

After waiting until irb process climbs to about 1G RSS I stop the process... and see exactly same numbers. Maybe visualizing the call graph will help me understand what's going on?

jeprof --show_bytes --pdf `which ruby` jeprof.10536.0.f.heap > ruby.pdf

Using local file /home/mhi/.rvm/rubies/ruby-2.4.2/bin/ruby.
Using local file jeprof.10536.0.f.heap.
No nodes to print

So something is clearly wrong, and that's what I need help figuring out.

Here's full output from jemalloc stat: https://pastebin.com/RiMLtqA6

UPD.

So I've updated all the native-extension related gems, here's output of bundle exec ruby -e 'puts Gem.loaded_specs.values.select{ |i| !i.extensions.empty? }.map{ |i| "#{i.name} #{i.version}" }':

io-console 0.4.6
nokogiri 1.8.1
bcrypt 3.1.11
debug_inspector 0.0.3
binding_of_caller 0.7.2
json 2.1.0
capybara-webkit 1.14.0
damerau-levenshtein 1.3.0
unf_ext 0.0.7.4
eventmachine 1.2.5
ffi 1.9.18
kgio 2.11.0
msgpack 1.1.0
mysql2 0.4.9
rainbow 2.2.2
raindrops 0.18.0
rbtrace 0.4.8
stackprof 0.2.10
therubyracer 0.12.3
unicode 0.4.4.4
unicorn 5.3.0

Same result: RSS, Memory slots

We need the output of this: `bundle exec ruby -e 'puts Gem.loaded_specs.values.select{ |i| !i.extensions.empty? }.map{ |i| "#{i.name} #{i.version}" }'` — Mike Perham, Nov 10 '17 at 18:15
You are on a recent Ruby. Make sure you are using the latest versions of all native extensions. — Mike Perham, Nov 10 '17 at 18:16
@MikePerham Ok, yeah, there's a couple of outdated gems with native extensions, I'll update them and see if it helps with the situation. Thanks for advice! — Roman, Nov 13 '17 at 13:00
This contains the vast sum of my knowledge https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting — Mike Perham, Nov 14 '17 at 17:11
Use an application performance monitoring tool like New Relic (which has first-class support for [instrumenting Sidekiq workers](https://docs.newrelic.com/docs/agents/ruby-agent/background-jobs/sidekiq-instrumentation)) — anothermh, Nov 14 '17 at 18:55
Notice that [Ruby 2.4.2 has a known issue with `jemalloc`](https://bugs.ruby-lang.org/issues/13899)... the issue was closed about a month ago, but I don't know if the package you're using is patched. Actually, I don't think the patch was released yet. It's likely all the `jemalloc` stat data is irrelevant. — Myst, Nov 18 '17 at 03:29
Could you post a small minimal example of the task, so we could try this at home? — Myst, Nov 20 '17 at 15:12

Myst · Answer 1 · 2017-11-18T18:06:48.843

2

Ruby 2.4.2 has a known issue with jemalloc.

the issue was closed about a month ago, but I don't know if the package you're using is patched... Actually, I don't think the patch was released yet. It's likely all the jemalloc stat data is irrelevant.

Also, this seems like an X-Y question (you're asking about jemalloc, while you probably want a solution to the memory "leak").

Before assuming a memory leak in native code (though a distinct possibility), I would consider how the task's scope might affect the garbage collector and try to minimize both the scope and any long-life variables.

For example, if your task is a Proc, it might be bound to the global scope, which means that some variables might live forever...

Try enclosing the task in a function and make sure that none of the variables are referenced once the task is done.

edited Nov 18 '17 at 18:06

answered Nov 18 '17 at 03:34

Myst

18,516
2
45
67

Tried it with 2.3.0 - got same results. What makes me think this is a native code issue is the fact that live objects count in Ruby heap does not increase over time. – Roman Nov 20 '17 at 14:51
@Roman , did you try using Ruby 2.4.2 with the updated `configure` file that closed the bug report? This should allow you to use `jemalloc`. – Myst Nov 20 '17 at 15:15
@Roman , please note that having the same number of objects might not be an absolute indication. For example, if you have an ever growing String, it's just one object that might lives forever and constantly grow. Numbers aren't always objects either, so an array of numbers can be another object that can grow to a frightening size. – Myst Nov 20 '17 at 15:22

Debugging sidekiq worker memory leak with jemalloc

1 Answers1