Running out of heap space with web crawler

Question

I wrote a small crawler and found out it was running out of heap space (even though I limit the number of URLs in my list to 300 currently).

With Java Memory Analyzer I found out that the consumers is char[] (45MB out of 64MB, or also more if I increase allowed size; it just grows constantly).

The analyzer also gives me the content of the char[]. It contains HTML pages that were read by the crawlers.

With some more deep analysis on different settings for -Xmx[...]m I found out that Java uses almost all space it has available and then gets out of heap as soon as I want to download an image with 3MB size.

When I give Java 16MB, it uses 14MB and fails, when I give it 64MB it used 59MB and fails when trying to download a large image.

Reading pages is done with this piece of code (Edited and added .close()):

private String readPage(Website url) throws CrawlerException {
    StringBuffer sourceCodeBuffer = new StringBuffer();
    try {
        URLConnection con = url.getUrl().openConnection();
        con.setConnectTimeout(2000);
        con.setReadTimeout(2000);

        BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream()));
        String strTemp = "";
        try {
            while(null != (strTemp = br.readLine())) {
                sourceCodeBuffer = sourceCodeBuffer.append(strTemp);
            }
        } finally {
            br.close();
        }
    } catch (IOException e) {
        throw new CrawlerException();
    }

    return sourceCodeBuffer.toString();
}

Another function uses the returned string in a while loop, but to my knowledge the space should be freed as soon as the string is overwritten with the next page.

public void run() {
    boolean stop = false;

    while (stop == false) {
        try {
            Website nextPage = getNextPage();

            String source = visitAndReadPage(nextPage);
            List<Website> links = new LinkExtractor(nextPage).extract(source);
            List<Website> images = new ImageExtractor(nextPage).extract(source);

            // do something with links and images, source is not used anymore
        } catch (CrawlerException e) {
            logger.warning("could not crawl a url");
        }
    }
}

Below is an example of the output the analyzer gives me. When I want to see where these char[] are still required, the Analyzer cannot tell. So I guess they are not needed anymore and should be garbage collected. As its always a slightly bit below the maximum space, it also seems Java does garbage collecting, but only as much as necessary to keep the program running as for now (not thinking about there might be large input coming).

Also, explictely calling System.gc() every 5 seconds or even after setting source = null; did not work.

The website codes just seem to be stored as long as it is possible in any way.

Am I using something similar to ObjectOutputStream which enforces the read strings to be maintained forever? Or how is it possible Java does keep these website Strings in a char[] array so long?

Class Name                                                                                                                                                                                                                                                                                   | Shallow Heap | Retained Heap | Percentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
char[60750] @ 0xb02c3ee0  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.512 |       121.512 |      1,06%
char[60716] @ 0xb017c9b8  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.448 |       121.448 |      1,06%
char[60686] @ 0xb01f3c88  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.384 |       121.384 |      1,06%
char[60670] @ 0xb015ec48  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.352 |       121.352 |      1,06%
char[60655] @ 0xb01d5d08  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.328 |       121.328 |      1,06%
char[60651] @ 0xb009d9c0  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.320 |       121.320 |      1,06%
char[60637] @ 0xb022f418  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...|      121.288 |       121.288 |      1,06%

Edit

After testing it with even more memory, I found such an occurrence of URL in the dominator tree

Class Name                                                                                                                                                                                                                                                                                              | Shallow Heap | Retained Heap | Percentage

crawling.Website @ 0xa8d28cb0                                                                                                                                                                                                                                                                           |           16 |       759.776 |      0,15%
|- java.net.URL @ 0xa8d289c0  https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN...       |           56 |       759.736 |      0,15%
|  |- char[379486] @ 0xa8c6f4f8  <!DOCTYPE html><html lang="en">  <head>  <meta charset="utf-8">  <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9">  <title>Google Accounts</title><style type="text/css">  html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl,  dt, dd, ol, ul, li, t...    |      758.984 |       758.984 |      0,15%
|  |- java.lang.String @ 0xa8d28a40  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...|           24 |           624 |      0,00%
|  |  '- char[293] @ 0xa8d28a58  /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...    |          600 |           600 |      0,00%
|  |- java.lang.String @ 0xa8d289f8  c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...|           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a10  www.google.com                                                                                                                                                                                                                                                     |           24 |            24 |      0,00%
|  |- java.lang.String @ 0xa8d28a28  /recaptcha/api/image                                                                                                                                                                                                                                               |           24 |            24 |      0,00%

From the intendation I am really wondering: Why is the HTML source part of java.net.URL? Does this come from the URLConnection I had opened?

First point: calling a `StringBuffer` variable `string` is *really* confusing... — Jon Skeet, Jul 19 '12 at 16:07
Do you use `source.substring()` or regular expressions (they use it too return group matches) in `LinkExtractor` or `ImageExtractor`? `substring()` doesn't create new strings but just views keeping the whole character array in memory. — pingw33n, Jul 19 '12 at 16:19
@Aufziehvogel List the path to gc-root of one or more of those char arrays. Either 1, you have too many workers going at the same time, or 2, the arrays arent being released from it's gc-root and you have a memory leak. — John Vint, Jul 19 '12 at 16:24
@pingw33n I use regular expressions in `LinkExtractor` and `ImageExctractor`? So is this bad, because it creates new strings? — aufziehvogel, Jul 19 '12 at 17:19
@JohnVint I uploaded the "Path to gc-root" output for one such occurence here: http://pastebin.com/Cx93QZUS Does this tell you anything? I also included some new information (I found after allowing 512MB) into the post. — aufziehvogel, Jul 19 '12 at 17:21
btw. it’s correct that the `HashMap` has 76MB, I know that (and will correct it). The bigger problem I cannot solve is 400MB of `char`. — aufziehvogel, Jul 19 '12 at 17:33
@Aufziehvogel Can you link a few more of the char[] path to gc root? May help to see others — John Vint, Jul 19 '12 at 18:27
FYI: With modern JVM GC capabilities, calling `System.gc()` really is not necessary. I understand that you have a memory problem but it is most likely not helping. — Gray, Jul 19 '12 at 21:28

Attila · Answer 1 · 2012-07-19T16:55:29.200

2

I would first try to close the readers and the URL connection as well at the end of the readPage method. Best if you put this logic in a finally clause.

The connections kept open will use memory and depending on the internals the GC might not be able to reclaim it, even if you no longer reference it in your code

Update (based on comments): the connection itself has no close() method and will be closed when all readers attached to it are closed.

edited Jul 19 '12 at 16:55

answered Jul 19 '12 at 16:14

Attila

28,265
3
46
55

1

+1. I was just about to post the same thing. Anytime you have an OOME, the first thing you should look for is anything with a `close()` method that isn't getting its `close()` method called when you're done with it. – Daniel Pryden Jul 19 '12 at 16:17
Certainly right for the readers, but the [`URLConnection`](http://docs.oracle.com/javase/7/docs/api/index.html) has no `close()` method, it is instead automatically closed after [all readers are closed](http://stackoverflow.com/questions/272910/in-java-when-does-a-url-connection-close). – Jan Gerlinger Jul 19 '12 at 16:32
I added a `close()` command to my code above (first code fragment). But there still seem to be references from `java.net.URL` instances to the html sources. Have I done something wrong? When I saw that the Memory Analyzer found dependency between `java.net.URL` and the many `char`, I immediately thought it might be because of the `Url.openConnection()` but these connections are still there (and memory is still going up). – aufziehvogel Jul 19 '12 at 18:05
@Aufziehvogel - if it's the connection, you could try James Schek's suggestion on the SO thread listed by Jan (try casting to `HttpURLConnection` and call `disconnect()` on that). Otherwise you might be storing all the websites's contents in a `static` container and never purging its contents -- the latter I cannot tell from the code you posted – Attila Jul 19 '12 at 18:28
@Attila: Already tried `HttpURLConnection`, **but** what about the `static` thing? I have something like this `private static Set knownWebsites = Collections.synchronizedSet(new HashSet());` (the `HashMap` is not static`). Is this a problem? I thought it was easier using `static` than passing everything via constructor to all crawlers. – aufziehvogel Jul 19 '12 at 19:22
That static set might be your problem: static means it is not tied to (the lifetime of) a particular object, so the contents will be retained until the program finishes (or you explicitly remove it). I would suggest removing the `Website` elements once you are done with processing them (if you want to avoid re-processing ones you already have, consider storing the URL instead of the full content of the webpage) -- this of course assumes that `Website` _does_ contain the full content of the site, and thus is the source of running out of memory – Attila Jul 19 '12 at 20:24
@Attila: `Website` does not contain any HTML code, but one of these static variables seems to be the source of all evil. I removed them for a try and only used my `queue` (which is passed by reference to all crawlers, because I need to fill it with initial values before) and now it even runs on 10-30MB with 15 crawlers the same time. – aufziehvogel Jul 21 '12 at 17:31

score 1 · Accepted Answer · answered Jul 19 '12 at 16:23

I'm not sure your info leads to the conclusion that garbage collection isn't working. You're simply running out of memory when allocating more memory. You say that you think that there are objects that are eligible for GC, but the JVM doesn't. I'm pretty sure I'd trust the JVM versus a guess!

You have a memory leak somewhere (else) in your app. You're holding on to a reference to the whole content of a web page somewhere in some object. And that is filling your free memory.

score 0 · Answer 3 · answered Jul 19 '12 at 16:21

0

When I give Java 16MB, it uses 14MB and fails, when I give it 64MB it used 59MB and fails when trying to download a large image.

This is not surprising as you are so close to your limit. A 3 MB image can unpack to be 60 MB or more when loaded (de-compressed) Can you increase the maximum to 1 GB?

answered Jul 19 '12 at 16:21

Peter Lawrey

525,659
79
751
1,130

I allowed it 512MB and it got up to there constantly (in about half an hour or an hour), Analyzer tell me it’s 380MB of char. So at least the images do not seem to be the problem. 73MB used by a HashMap about crawling times for domains (which is not yet set to be limited) and the rest 40MB to the rest of the program. – aufziehvogel Jul 19 '12 at 17:16
The reason it fails when it gets close, is because it needs a continuous block of free memory of the required size, not just that much free memory total. – Rob Trickey Jul 19 '12 at 18:23

score 0 · Answer 4 · answered Jul 19 '12 at 16:26

It is likely a reference is kept somewhere preventing garbage collection. This always takes mucking around to correct. I usually start with a profiler with heap analysis. If possible, write a small test program that loads a page and not much else. It can simply work off a list of 3-4 urls that contain some large pictures. If the page contains a large picture, like 10+ MB, it should be easy to find in the profiler. The worst case is that a library being used holds the reference. A small test program would be the best way to debug.

score 0 · Answer 5 · answered Jul 19 '12 at 18:16

How many threads do you have running at any particular time? It appears the char array you sent in the pastebin is thread-local (implying no leak). What you may see happening is if you are running too many concurrently you will naturally run out of memory. Try running with 2 threads but the same number of URLs.

score 0 · Answer 6 · edited May 23 '17 at 10:34

0

Another possible reason I found is that substring uses the same old large char array that was used by the original string. So if you retain a substring, the complete string is being retained.

edited May 23 '17 at 10:34

Community

1
1

answered Aug 04 '12 at 17:57

aufziehvogel

7,167
5
34
56

Running out of heap space with web crawler

Edit

6 Answers6