7

I have a Clojure program that is consuming a large amount of heap while running (I once measured it at somewhere around 2.8GiB), and I'm trying to find a way to reduce its memory footprint. My current plan is to force garbage collection every so often, but I'm wondering if this is a good idea. I've read How to force garbage collection in Java? and Can I Force Garbage Collection in Java? and understand how to do it — just call (System/gc) — but I don't know if it's a good idea, or even if it's needed.

Here's how the program works. I have a large number of documents in a legacy format that I'm trying to convert to HTML. The legacy format consists of several XML files: a metadata file that describes the document, and contains links to any number of content files (usually one, but it can be several — for example, some documents have "main" content and footnotes in separate files). The conversion takes anywhere from a few milliseconds for the smallest documents, to about 58 seconds for the largest document. Basically, I'm writing a glorified XSLT processor, though in a much nicer language than XSLT.

My current (rather naïve) approach, written when I was just starting out in Clojure, builds a list of all the metadata files, then does the following:

(let [parsed-trees (map parse metadata-files)]
  (dorun (map work-func parsed-trees)))

work-func converts the files to HTML and writes the result to disk, returning nil. (I was trying to throw away the parsed-XML trees for each document, which is quite large, after each pass through a single document). I now realize that although map is lazy and dorun throws away the head of the sequence it's iterating over, the fact that I was holding onto the head of the seq in parsed-trees is why I was failing.

My new plan is to move the parsing into work-func, so that it will look like:

(defn work-func [metadata-filename]
  (-> metadata-filename
      e/parse
      xml-to-html
      write-html-file)
  (System/gc))

Then I can call work-func with map, or possibly pmap since I have two dual-core CPUs, and hopefully throw away the large XML trees after each document is processed.

My question, though, is: is it a good idea to be telling Java "please clean up after me" so often? Or should I just skip the (System/gc) call in work-func, and let the Java garbage collector run when it feels the need to? My gut says to keep the call in, because I know (as Java can't) that at that point in work-func, there is going to be a large amount of data on the heap that can be gotten rid of, but I would welcome input from more experienced Java and/or Clojure coders.

Community
  • 1
  • 1
rmunn
  • 34,942
  • 10
  • 74
  • 105
  • Short answer: If you're *needing* to call `gc` that often, you may need to reorganize your data structure. I recommend retitling your question to be more broad, since you may be approaching the question with blinders on. – chrylis -cautiouslyoptimistic- Feb 26 '14 at 10:31
  • For clarity: `e/parse` in my work function is the XML parser from Enlive, and I'm using Enlive transformations to convert the XML to HTML. – rmunn Feb 26 '14 at 10:36
  • @chrylis - Hmm, you may be right. I seem to really be asking "How can I reduce my code's memory footprint?" rather than "How often should I call System/gc?". I'll update the question to reflect that. – rmunn Feb 26 '14 at 10:37
  • Manually garbage collecting will not help to reduce the memory footprint of your application; you need to solve it at the source; why is the application creating so much garbage to begin with? Is that even a problem, or just a result of the characteristics of your application? Let the JVM manage the garbage collector, it is much better at it than you. – Gimby Feb 26 '14 at 10:38
  • 2
    `parsed-trees` does not hold onto the head of the seq. This word of caution about holding head of a seq is a bit dated – it's still true but this problem is not as frequent as it used to be. The compiler releases unneeded references right after their last usage. So since you don't use `parsed-trees` after the call to `map` there's no head retention issue. I think the leak is somewhere else. Which libraries are your using, do you use some memoiztion? – cgrand Feb 26 '14 at 11:27
  • @cgrand - I'm using your Enlive library, as a matter of fact. The xml-to-html function is basically one big call to (e/at), with about ten different selectors and helper functions. (None of which use `:lockstep` yet, though that's the next optimization I'm going to try.) I used to have some memoization in there as there was one big footnotes file that was shared by a whole series of journals, but recently one of my colleagues updated the XML source files to split that huge footnotes file into one file per article, at which point I removed the memoization as it was no longer helping. – rmunn Feb 26 '14 at 13:52
  • @rmunn that's what I feared :-) Without seeing your code it's hard to tell but maybe it's the (sometimes) memoized state machine behind selectors which goes out of hand. If you can't show the code publicly, drop me a mail. – cgrand Feb 26 '14 at 14:37
  • Are you actually consuming java heap or is your OS just using a bunch of memory during the file read (e.g. to cache)? Similar was observed in [this question](http://stackoverflow.com/questions/21332187/what-will-the-behaviour-of-line-seq-be/21333924#comment32198382_21333924). As with that question, it is not obvious that the finger should be wagged at the let until more is known about what you are doing and where (OS or Java heap) memory is being used. – A. Webb Feb 26 '14 at 14:49
  • @cgrand - actually, by moving the parse-trees step into work-func as I planned, my total memory consumption dropped from a peak of 2.8 Gib to about one gigabyte or so (I measured by running `top` and watching the mem% of my Java process, so that's a very rough estimate). So I don't think it was Enlive leaking, it was something else I was doing. Let me double-check that I was really running `dorun` and not `doall`... I often forget which one holds on to seq head and which one doesn't, and if I was running `doall` that would explain my heap problems, I think. – rmunn Feb 26 '14 at 15:22
  • 2
    Well, it looks like I was using `doall` after all, not `dorun`. Hence I *was* holding on to the head of the seq, and keeping all 800 or so trees in heap until the **entire document set** was finished processing. Just switching to `dorun` probably saved me a lot of heap, and when you combine that with the other steps I took (moving the parsing inside my work function, and using `pmap` instead of `map`), I've not only saved heap but also time. My runtimes are down from 50-60 minutes to about 15-20 minutes, in very unscientific benchmarks. Thanks for the help, everybody. – rmunn Feb 26 '14 at 15:28
  • 1
    @rmunn, ok but you should use `doseq` and not `dorun`. – cgrand Feb 26 '14 at 15:56
  • @cgrand What, why? `dorun` is fine here. He could use `doseq` instead, but I don't know of anything wrong with the `dorun`/`map` combo. – amalloy Feb 27 '14 at 02:11
  • @cgrand - What's the difference (besides the surface semantics of `doseq` using bindings that look like `for`) between how `doseq` and `dorun` operate? I've seen this advice before — that if you're calling `(dorun (map f coll))` you should use `doseq` instead — but I don't understand why it makes a difference. – rmunn Feb 27 '14 at 03:06
  • @cgrand - Also, is there `doseq` equivalent of `pmap`? One where I can just drop in a new function, `parallel-doseq` or whatever it's called, and get parallel execution of my work function across the collection? Because right now I'm getting a lot of benefit from `pmap`. – rmunn Feb 27 '14 at 04:47
  • @amalloy @rmunn To me `dorun` conveys the idea that it's ok to perform side effects in functions passed to `map` (and others). `doseq` on the other hand communicates that side-effects are going to occur inside its body and the reader should expect the sequence generation to be pure. This is the social argument and it's the strongest. The technical argument is that the laziness contract in Clojure is intentionally weak (chunked for example) and thus they should not be used for any kind of control flow, hence not for side-effects. – cgrand Feb 27 '14 at 08:59
  • @rmunn re: `pmap`, no `pdoseq` I'm aware of. I would look into `r/fold` over a vector of filenames. – cgrand Feb 27 '14 at 09:03

1 Answers1

10

Calling System/gc is not a helpful strategy. Assuming for now that you can't reduce the actual memory footprint of your code, what you should ensure is avoiding major GC. This will either happen automatically for you (by resizing the Young Generation until all your temp data fits), or you can tune it with explict JVM options to make the YG exceptionally large.

As long as you keep your short-lived objects from spilling into the Old generation for lack of space, you'll experience very short GC pauses. You will also not have to worry about explicitly invoking GC: it happens as soon as the Eden Space fills up.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • This helped a lot: I refactored my work function to make sure it didn't hold on to any temporary objects after it was done with each document. And without doing any `(System/gc)` calls, my runtime dropped by a factor of 3, more or less: from 50-60 minutes on average to 15-20 minutes on average. I'm accepting this answer since it helped me, though what really helped me the most was the comments on my question. – rmunn Feb 26 '14 at 15:30
  • If you're interested, I'll release this answer as a Community Wiki and you can edit in the gist of what helped you the most from the comments. – Marko Topolnik Feb 26 '14 at 21:17
  • I've read through the comments and the one I agree with the most is the very last one, from Cristophe; in fact, that's what I was going to write myself---don't model the entire thing as a mapping transformation which you only use for the side effect. That's what `doseq` is for. – Marko Topolnik Feb 26 '14 at 21:23
  • What's the difference between `doseq` and `dorun`, besides `doseq`'s `for`-like semantics? What inefficiencies am I creating by using `(dorun (map f coll))` instead of `doseq`? – rmunn Feb 27 '14 at 03:09
  • 1
    It's not about efficiency, but about abusing sequence transformation for side effects. Each time you catch yourself performing a `map` operation on a sequence without ever considering its result, you know you're doing something wrong. Further, it is wrong to rely on `map` to perform the operation only once, in order, and only at the time you access a sequence member. `doseq` gives you all those guarantees. – Marko Topolnik Feb 27 '14 at 06:31