1

So i have an application that creates like 2000 objects.

For each object, it downloads a web page (String of approx. 75kb), creates a DOM document object model of the entire html tree and discards the String (it goes out of scope).

It then extracts some text and links from the DOM, and discards the DOM (by setting it to null).

After about 1000 objects (depending on how much applications I have open, it might be after 50 objects) I get an OutOfMemory exception, and with Process Explorer I can see the memory footprint has been increasing throughout, in logarithmic steps.

I tried inserting a System.gc(); after setting it to null, but memory usage still keeps increasing, but now not with logarithmic steps but with steps of approx 0.5Mb after every processed object. Furthermore, while debugging, whenever I step over System.gc() the footprint increases by this amount, and it stays the same until the instruction pointer is at the same System.gc() again.

[edit]

I ran the profiles on a dump as suggested in an answer, and found that every of those classes still stores a 150kb string (75k chars). This totals 242mb. So the question becomes, how do I keep the substrings without keeping the original string? Apparently the String constructor does this.

Mark Jeronimus
  • 9,278
  • 3
  • 37
  • 50
  • Can you at least post the code that extracts the Strings and stores the substrings? Be sure to intern() any strings that could be duplicates. In most cases, you will never need to explicitly call the `new String(...)` constructor. – rob May 14 '12 at 07:44
  • There is something different about the string obtained from a `StringBuilder`. When doing `intern()` on that string, it just returns the original! When I make a new string, either from literal or by new `String(string)`, and then do `intern()`, I DO get another object (even if the new and old strings are unique and equal!) – Mark Jeronimus May 14 '12 at 08:12

4 Answers4

2

This looks like a memory leak. I would guess you are not closing the HTTP connection or cleaning up after parsing HTML (?), but it is just guessing. You have two options to diagnose the problem:

  • dump memory on out of memory error (-XX:+HeapDumpOnOutOfMemoryError) and use memory profiler. It will tell you what occupies most of the memory

  • try to remove some processing steps (fetching data via HTTP, parsing HTML, extracting data) and see without which step the memory growth stops. This step causes memory leak.

Also calling System.gc() won't help you, ever.

Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
1

First things first, you cannot force the JVM to do Garbage Collection. You can only make a suggestion API. Further setting something to null does not guarantee that ALL references to the object have been removed. My guess is you have forgotten about the String pool Without seeing any code these are the assumptions that we have to work from. Also, you should look at caching the results instead of discarding them every time as it is a colossal waste of time and resources within the JVM.

Community
  • 1
  • 1
Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
  • I don't know what you mean by 'caching resources' in this context. I cannot possibly store all data for every page in memory as it will surely fill my memory. – Mark Jeronimus May 14 '12 at 06:13
1

One problem could be when extracting substrings is that the long original string is still referenced (good if you want to make many substrings from one original, bad if the original is very long and you only want to use the single substring).

Try making a dump of the memory to see what objects are retained and where they are referenced. The dump can be obtained with with the -XX:HeapDumpOnOutOfMemoryError when the memory is full. You can also use jmap -dump:format=b,file=heap.bin to get dumps. With this you can get a dump after each time you process a document and then compare the dumps using Eclipse Memory Analyzer Tool (MAT) to see what new objects were created and retained.

Roger Lindsjö
  • 11,330
  • 1
  • 42
  • 53
  • Ah I didn't think of that (I should have). I use these substrings in URL objects that I keep, but these string objects ought to be about 75kb (or 150kb in case of char) right, not ~500kb. – Mark Jeronimus May 14 '12 at 07:17
0

There is rarely a good reason to explicitly invoke the garbage collector except for diagnostic purposes.

When you extract Strings from the DOM, be sure to intern() them or implement your own object pooling if another part of your program retains references to anything that directly comes from the DOM.

Use your profiler to confirm that nothing else is retaining references to the DOM or other objects that you think you're throwing away. Also keep in mind that Java's built-in DOM implementation can have about a 5x memory overhead, and make sure your maximum heap size (-Xmx) is large enough.

rob
  • 6,147
  • 2
  • 37
  • 56
  • `intern()` does not create a copy. Because the string is unique already, it just returns the original. – Mark Jeronimus May 14 '12 at 07:56
  • The intention isn't to create a copy; it's to merge equivalent Strings so you're only storing one object. – rob May 14 '12 at 08:09