I wrote a small crawler and found out it was running out of heap space (even though I limit the number of URLs in my list to 300 currently).
With Java Memory Analyzer I found out that the consumers is char[]
(45MB out of 64MB, or also more if I increase allowed size; it just grows constantly).
The analyzer also gives me the content of the char[]
. It contains HTML pages that were read by the crawlers.
With some more deep analysis on different settings for -Xmx[...]m
I found out that Java uses almost all space it has available and then gets out of heap
as soon as I want to download an image with 3MB size.
When I give Java 16MB, it uses 14MB and fails, when I give it 64MB it used 59MB and fails when trying to download a large image.
Reading pages is done with this piece of code (Edited and added .close()
):
private String readPage(Website url) throws CrawlerException {
StringBuffer sourceCodeBuffer = new StringBuffer();
try {
URLConnection con = url.getUrl().openConnection();
con.setConnectTimeout(2000);
con.setReadTimeout(2000);
BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream()));
String strTemp = "";
try {
while(null != (strTemp = br.readLine())) {
sourceCodeBuffer = sourceCodeBuffer.append(strTemp);
}
} finally {
br.close();
}
} catch (IOException e) {
throw new CrawlerException();
}
return sourceCodeBuffer.toString();
}
Another function uses the returned string in a while loop, but to my knowledge the space should be freed as soon as the string is overwritten with the next page.
public void run() {
boolean stop = false;
while (stop == false) {
try {
Website nextPage = getNextPage();
String source = visitAndReadPage(nextPage);
List<Website> links = new LinkExtractor(nextPage).extract(source);
List<Website> images = new ImageExtractor(nextPage).extract(source);
// do something with links and images, source is not used anymore
} catch (CrawlerException e) {
logger.warning("could not crawl a url");
}
}
}
Below is an example of the output the analyzer gives me. When I want to see where these char[]
are still required, the Analyzer cannot tell. So I guess they are not needed anymore and should be garbage collected. As its always a slightly bit below the maximum space, it also seems Java does garbage collecting, but only as much as necessary to keep the program running as for now (not thinking about there might be large input coming).
Also, explictely calling System.gc()
every 5 seconds or even after setting source = null;
did not work.
The website codes just seem to be stored as long as it is possible in any way.
Am I using something similar to ObjectOutputStream
which enforces the read strings to be maintained forever? Or how is it possible Java does keep these website Strings
in a char[]
array so long?
Class Name | Shallow Heap | Retained Heap | Percentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
char[60750] @ 0xb02c3ee0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.512 | 121.512 | 1,06%
char[60716] @ 0xb017c9b8 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.448 | 121.448 | 1,06%
char[60686] @ 0xb01f3c88 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.384 | 121.384 | 1,06%
char[60670] @ 0xb015ec48 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.352 | 121.352 | 1,06%
char[60655] @ 0xb01d5d08 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.328 | 121.328 | 1,06%
char[60651] @ 0xb009d9c0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.320 | 121.320 | 1,06%
char[60637] @ 0xb022f418 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.288 | 121.288 | 1,06%
Edit
After testing it with even more memory, I found such an occurrence of URL in the dominator tree
Class Name | Shallow Heap | Retained Heap | Percentage
crawling.Website @ 0xa8d28cb0 | 16 | 759.776 | 0,15%
|- java.net.URL @ 0xa8d289c0 https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN... | 56 | 759.736 | 0,15%
| |- char[379486] @ 0xa8c6f4f8 <!DOCTYPE html><html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9"> <title>Google Accounts</title><style type="text/css"> html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl, dt, dd, ol, ul, li, t... | 758.984 | 758.984 | 0,15%
| |- java.lang.String @ 0xa8d28a40 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...| 24 | 624 | 0,00%
| | '- char[293] @ 0xa8d28a58 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl... | 600 | 600 | 0,00%
| |- java.lang.String @ 0xa8d289f8 c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...| 24 | 24 | 0,00%
| |- java.lang.String @ 0xa8d28a10 www.google.com | 24 | 24 | 0,00%
| |- java.lang.String @ 0xa8d28a28 /recaptcha/api/image | 24 | 24 | 0,00%
From the intendation I am really wondering: Why is the HTML source part of java.net.URL
? Does this come from the URLConnection I had opened?