3

Here's a simple case:

let html = `<<some huge html file>>`
var libxmljs = require("libxmljs");

class MyObject{
  constructor(html){
    this.doc = libxmljs.parseHtml(html);
    this.node = this.doc.root()
  }
}

let obj

for(var i = 0; i < 100000; i++){
  obj = new MyObject(html)
  // if I uncomment the next line it works fine
  // obj.node = null
  console.log(i)
}

When I run it the script quickly runs out of memory, apparently because obj.node isn't getting garbage collected properly. How can I make sure that happens without explicitly setting it to null when I think I'm done with it?

pguardiario
  • 53,827
  • 19
  • 119
  • 159

2 Answers2

1

The object .root() returns seems to GC more if you don't store the reference specifically in a class instance. The memory usage still seems fairly leaky as the full amount of heap allocated is never reclaimed. Node itself seems to use about twice as much memory than lives on the heap to take care of the native libxml code. Maybe raise an issue on libxmljs as this quacks like a bug.

Not storing the object in the class instance but passing it through works better.

class MyObject{
  constructor(){
    this.doc = libxmljs.parseHtml(html)
  }
  get node(){
    return this.doc.root()
  }
}

Using a plain object works better too.

function myObject(){
  let doc = libxmljs.parseHtml(html)
  let node = doc.root()
  return {
    doc: doc,
    node: node,
  }
}

As an alternative maybe try one of the JS based parsers.

Matt
  • 68,711
  • 7
  • 155
  • 158
  • Why isn't Node gc'ing it though? This feels to me like a bug in Node. I feel like having destructors would solve my problem, but the Node team thinks that it doesn't need destructors while my example shows that maybe it does. I'm new to Node though so I'm hoping I'm overlooking something obvious. – pguardiario Oct 20 '17 at 08:27
  • Node would normally, if you do that with plain JS objects it will be gc'd fine. libxmljs is based on the native libxml2 library, which opens up the possibility for more memory problems. A JS based parser might be slower but it's less likely to exhibit issues like this. – Matt Oct 20 '17 at 09:48
  • It's still not clear why node isn't gc'ng my object properly. It sounds like node really needs destructors even though it claims not to. A destructor would solve my problem nicely. – pguardiario Oct 20 '17 at 10:14
  • The idea is you don't need destructors in normal operation though... like I said it looks like a bug in the way the `libxmljs` module keeps references between it's native C data and the JS objects. – Matt Oct 20 '17 at 10:29
  • Hmm, your use of "normal operation" is problematic to me. I feel like this is a normal operation and that there should be a simple way to ensure that objects are properly garbage collected. I want to wait for others to weigh in on this, no offense and I thank you for your answer. – pguardiario Oct 20 '17 at 11:30
1

TL;DR: It's the library and not node which is an issue.

Long answer

Here is a slightly modified code

var heapdump = require('heapdump');
const fs = require('fs');
var libxmljs = require("libxmljs");

const content = fs.readFileSync('./html2.htm');
let id = 0;

class MyObject{
  constructor(){
    this.doc = libxmljs.parseHtml(content);
    this.node = this.doc.root()
  }
}

let obj;

function createObject () {
  obj = new MyObject(content);
};


try {
  for(var i = 0; i < 3000; i++){
    createObject();
    // if I uncomment the next line it works fine
    // obj.node = null
    console.log(i);
    if (i === 50) {
      heapdump.writeSnapshot('/Users/me/3.heapsnapshot');
    }
    if (i === 100) {
      heapdump.writeSnapshot('/Users/me/4.heapsnapshot');
    }
    if (i === 150) {
      heapdump.writeSnapshot('/Users/me/5.heapsnapshot');
    }

  }
  console.log('done');
}
catch(e) {
  console.log(e);
}

Below is the relevant section of the heapdump diff we took in the code (3 and 4)

enter image description here

And even clear when we look at 4 and 5 heapdump

enter image description here

Few thing that we can conclude from these heapdumps:

  • There is no memory leak in the JS part.
  • The size of the heapdump does not match the size of the process we see on htop/top/activity monitor depending on your OS. (12 MB of heapdump versus few Gb in RAM)

Heapdump will only give us memory leak which are in JS. Since this library has c code, heapdump will not capture leaks which will be there.

I am not sure how we can capture the dump from that library or why setting it to null allows the memory to be freed but it should be safe to assume that node gc is doing everything it can.

Hope this helps

AbhinavD
  • 6,892
  • 5
  • 30
  • 40
  • Thanks, that's helpful information. I'm still not sure I agree that node is doing everything it can. – pguardiario Oct 21 '17 at 23:04
  • That interesting, `node --trace_gc` does track some object retention in the JS heap. The old space grows at about 1/2 the speed of the actual process memory on the OS. `process.memoryUsage()` does shows the same, on top of the `external` tracked memory – Matt Oct 24 '17 at 07:49