5

We have an erlang/elixir application (on 18/erts 7.3.1) that processes large json payloads.  

Here's a typical workflow:

  1. A listener gets a token from rabbitmq and sends to a gen_server.

  2. The gen_server puts the token into a ETS table with a future time (current + n secs). A schedule job in the gen_server will pick up expired tokens from ETS and launch several short lived processes with these tokens.

  3. These short lived processes downloads 30-50k json payloads from elasticsearch (using hackney) and processes it, then uploads the results back to  elasticsearch, following that the process dies immediately. We keep track of these processes and have confirmed they die. We process 5-10 of these requests per second.

The Problem: We see an ever growing binary space and within 48 hours this grows to couple of GBs (seen via observer and debug prints). A manual GC also has no impact.

We already added "recon" and ran recon:bin_leak, however this only frees up few KBs and no impact on the ever growing binary space.

stack: Erlang 18/erts 7.3.1, elixir 1.3.4, hackney 1.4.4, poison 2.2.0, timex 3.1.13, etc., none of these apps holding the memory either.

Has anyone come across a similar issue in the past? Would appreciate any solutions.

Update 9/15/2017:

We updated our app to Erlang 19/ERTS 8.3 and hackney and poison libs to latest, still there is no progress. Here's some log within a GenServer, which periodically sends a message to itself, using spawn/receive or send_after. At each handle_info, it looks up an ets table and if it finds any "eligible" entries, it spawns new processes. If not, it just returns a {:noreply, state}. We print the VMs binary space info here at entry to the function (in KB), the log is listed below. This is a "queit" time of the day. You can see the gradual increase of binary space. Once again :recon.bin_leak(N) or :erlang.garbage_collect() had no impact on this growth.

11:40:19.896 [warn] binary 1: 3544.1328125

11:40:24.897 [warn] binary 1: 3541.9609375

11:40:29.901 [warn] binary 1: 3541.9765625

11:40:34.903 [warn] binary 1: 3546.2109375

--- some processing ---

12:00:47.307 [warn] binary 1: 7517.515625

--- some processing ---

12:20:38.033 [warn] binary 1: 15002.1328125

We never had a situation like this in our old Scala/Akka app which processes 30x more volume runs for years without an issue or restarts. I wrote both apps.

adean
  • 61
  • 4
  • 1
    If neither gc nor recon free the memory, this is not what is usually referred to as "binary leak", it seems there's something else going on. Do you persist any part of the payloads you receive? Something similar could happen if you store a subbinariy into the payload somewhere - maybe logs or other reporting system. – michalmuskala Sep 13 '17 at 16:07
  • Thanks for the clarification. We use values from the larger payload (Json -> Dict) to construct a different payload, then discard the original payload. We send the resulting payload to elasticsearch. The entire process that handles this terminates. We do persist rabbit event to a db, but it's pretty small. – adean Sep 13 '17 at 19:30
  • Don’t you convert tokens _to atoms_ at some point? It could be the memory allocated for atoms, which is never freed back to the OS. Also, while running indefinitely long, the VM will crash, but you could not get to that point yet. – Aleksei Matiushkin Sep 14 '17 at 08:17
  • 3
    You can try this after you parse JSON, transform structure where each binary copied by `binary:copy/1`. You will see performance drop of course but observe if the leak is still there. If it disappears you know the problem is keeping sub-binary to the original JSON and you can investigate further. It's a good test for your hypothesis. – Hynek -Pichi- Vychodil Sep 14 '17 at 09:24
  • The Json --> Dict handled by Poison, which I guess uses Jiffy. The point is, the entire process which retrieves the binary and does all the conversion dies. I get what you are saying, but I don't understand copying every key and value within the dict would make any difference, would it?. I just updated the question with more info. – adean Sep 15 '17 at 12:27
  • @adean He is saying that because any binary over a certain size (I forget exactly how large, but not so big, like 64 bytes or so) is *referenced* instead of *copied* between processes. If ProcA receives a binary and extracts a part of it to send to ProcB, then ProcA dies, *the whole binary is still in memory* with just that segment *referenced* by ProcB actually being used. That is the reason we have the function `binary:copy/1`. – zxq9 Sep 16 '17 at 00:56
  • Did you do ":erlang.garbage_collect()" from the gen_server or did you issue a ":erlang.garbage_collect(Pid)" ? Unlike on the JVM garbage_collect() only collects the process you run it from, so if you do it in iex it won't help you debug this. – Joel Borggrén-Franck Sep 16 '17 at 19:20
  • @zxq9 a single process retrieves the binary (via hackney), processes it and uploads the resulting binary to the elastic. It does not pass it around from process to process. I do understand the ref parsing/counting etc. - read a lot about it because of this issue :-). – adean Sep 17 '17 at 18:17
  • @joel I did run by getting all the pids in the system, then running gc on each of them. I believe even recon:bin_leak(n) would do the same. – adean Sep 17 '17 at 18:22
  • I came across https://github.com/benoitc/hackney/issues/378, would this be an issue here? – adean Sep 17 '17 at 18:22

1 Answers1

1

We found that the memory_leak came from a private reusable library that sends messages to Graylog and that uses the function below to compress that data before sending it via gen_udp.

defp compress(data) do
    zip = :zlib.open()
    :zlib.deflateInit(zip)
    output = :zlib.deflate(zip, data, :finish)
    :zlib.deflateEnd(zip)
    :zlib.close(zip)  #<--- was missing, hence the slow memory leak.
    output
end

Instead using term_to_binary(data, [:compressed]) I could have saved some headaches.

Thanks for all the inputs and comments. Much appreciated!

adean
  • 61
  • 4