I have a Clojure program that is consuming a large amount of heap while running (I once measured it at somewhere around 2.8GiB), and I'm trying to find a way to reduce its memory footprint. My current plan is to force garbage collection every so often, but I'm wondering if this is a good idea. I've read How to force garbage collection in Java? and Can I Force Garbage Collection in Java? and understand how to do it — just call (System/gc)
— but I don't know if it's a good idea, or even if it's needed.
Here's how the program works. I have a large number of documents in a legacy format that I'm trying to convert to HTML. The legacy format consists of several XML files: a metadata file that describes the document, and contains links to any number of content files (usually one, but it can be several — for example, some documents have "main" content and footnotes in separate files). The conversion takes anywhere from a few milliseconds for the smallest documents, to about 58 seconds for the largest document. Basically, I'm writing a glorified XSLT processor, though in a much nicer language than XSLT.
My current (rather naïve) approach, written when I was just starting out in Clojure, builds a list of all the metadata files, then does the following:
(let [parsed-trees (map parse metadata-files)]
(dorun (map work-func parsed-trees)))
work-func
converts the files to HTML and writes the result to disk, returning nil
. (I was trying to throw away the parsed-XML trees for each document, which is quite large, after each pass through a single document). I now realize that although map
is lazy and dorun
throws away the head of the sequence it's iterating over, the fact that I was holding onto the head of the seq in parsed-trees
is why I was failing.
My new plan is to move the parsing into work-func
, so that it will look like:
(defn work-func [metadata-filename]
(-> metadata-filename
e/parse
xml-to-html
write-html-file)
(System/gc))
Then I can call work-func
with map
, or possibly pmap
since I have two dual-core CPUs, and hopefully throw away the large XML trees after each document is processed.
My question, though, is: is it a good idea to be telling Java "please clean up after me" so often? Or should I just skip the (System/gc)
call in work-func
, and let the Java garbage collector run when it feels the need to? My gut says to keep the call in, because I know (as Java can't) that at that point in work-func
, there is going to be a large amount of data on the heap that can be gotten rid of, but I would welcome input from more experienced Java and/or Clojure coders.