4

We're running Docker containers of NiFi 1.6.0 in production and have to come across a memory leak.

Once started, the app runs just fine, however, after a period of 4-5 days, the memory consumption on the host keeps on increasing. When checked in the NiFi cluster UI, the JVM heap size used hardly around 30% but the memory on the OS level goes to 80-90%.

On running the docker starts command, we found that the NiFi docker container is consuming the memory.

After collecting the JMX metrics, we found that the RSS memory keeps growing. What could be the potential cause of this? In the JVM tab of cluster dialog, young GC also seems to be happening in a timely manner with old GC counts shown as 0.

How do we go about identifying in what's causing the RSS memory to grow?

Rahul Bhanushali
  • 553
  • 5
  • 14
  • Do you have a heap dump you've started to look at? Anything in the application source you have reason to suspect is leaking memory? Can you provide an [mcve] with a useful subset of the application source? – David Maze Oct 30 '18 at 10:40
  • In addition to the items listed above, are you using the Apache NiFi provided convenience binary (e.g. https://hub.docker.com/r/apache/nifi) or a custom image? – apiri Oct 30 '18 at 17:06
  • @DavidMaze, I don't have a head dump with right now but I can surely extract that. The docker container is running the https://github.com/apache/nifi using it's prebuilt processors and execute script processors. – Rahul Bhanushali Oct 31 '18 at 02:06
  • @apiri We've written our Dockerfile which is almost same as the one on docker hub with one change of entrypoint. – Rahul Bhanushali Oct 31 '18 at 02:08
  • More details concerning your execute script processors would be of interest. If you are able to share what they are doing there could be something there. Heap dumps would be helpful especially in a periodic manner as the conditions decay – apiri Oct 31 '18 at 02:53
  • Assuming that execute script would br causing it. I had tried deploying a fresh container and used the flowfile generator along with log attribute processor. I was able to replicate it with that as well. – Rahul Bhanushali Oct 31 '18 at 03:35

2 Answers2

3

You need to replicate that in a non-docker environment, because with docker, memory is known to raise.
As I explained in "Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container", docker has some bugs (like issue 10824 and issue 15020) which prevent an accurate report of the memory consumed by a Java process within a Docker container.

That is why a plugin like signalfx/docker-collectd-plugin mentions (two weeks ago) in its PR -- Pull Request -- 35 to "deduct the cache figure from the memory usage percentage metric":

Currently the calculation for memory usage of a container/cgroup being returned to SignalFX includes the Linux page cache.
This is generally considered to be incorrect, and may lead people to chase phantom memory leaks in their application.

For a demonstration on why the current calculation is incorrect, you can run the following to see how I/O usage influences the overall memory usage in a cgroup:

docker run --rm -ti alpine
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
dd if=/dev/zero of=/tmp/myfile bs=1M count=100
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes

You should see that the usage_in_bytes value rises by 100MB just from creating a 100MB file. That file hasn't been loaded into anonymous memory by an application, but because it's now in the page cache, the container memory usage is appearing to be higher.
Deducting the cache figure in memory.stat from the usage_in_bytes shows that the genuine use of anonymous memory hasn't risen.

The signalFX metric now differs from what is seen when you run docker stats which uses the calculation I have here.
It seems like knowing the page cache use for a container could be useful (though I am struggling to think of when), but knowing it as part of an overall percentage usage of the cgroup isn't useful, since it then disguises your actual RSS memory use.
In a garbage collected application with a max heap size as large, or larger than the cgroup memory limit (e.g the -Xmx parameter for java, or .NET core in server mode), the tendency will be for the percentage to get close to 100% and then just hover there, assuming the runtime can see the cgroup memory limit properly.
If you are using the Smart Agent, I would recommend using the docker-container-stats monitor (to which I will make the same modification to exclude cache memory).

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • To test this if the memory consumption stays at 100%. I put the --memory parameter to the docker run command. The memory consumption hovers around 100%, however I see the swap memory being used now and the CPU load average has increased. Is moving the application out of the container the best here? – Rahul Bhanushali Dec 01 '18 at 19:56
0

Yes, NiFi docker has memory issues, shoots up after a while & restarts on its own. On the other hand, the non-docker works absolutely fine.

Details: Docker: Run it with 3gb Heap size & immediately after the start up it consumes around 2gb. Run some processors, the machine's fan runs heavily & it restarts after a while.

Non-Docker: Run it with 3gb Heap size & it takes 900mb & runs smoothly. (jconsole)

Maqbool Ahmed
  • 316
  • 2
  • 7