0

I'm trying to crawl a website and insert the href that i found in a hashset, after 650 link inserted i get the exception java.lang.OutOfMemoryError: GC overhead limit exceeded. how can i get it to work ?

i'm putting the code below:

public void getPageLinks(String URL, String otherlinksSelector ) {
    if (!links.contains(URL)) {
        try {
            Document document = Jsoup.connect(URL).userAgent("Mozilla").get();
            Elements otherLinks = document.select(otherlinksSelector); 
            for (Element page : otherLinks) {
                if (links.add(URL)) {
                    System.out.println(URL);
                }
                getPageLinks(page.attr("abs:href"),otherlinksSelector);
            }
        } catch (Exception e) {
            System.err.println(e.getMessage());
        }
    }
}
Kirito Zack
  • 11
  • 1
  • 7

2 Answers2

0

First and formost, a crawler managing all URLs only in-memory has to be rather picky which URLs to keep and which to discard as memory is a limiting factor for crawlers unless you externalize this information or have a cluster with rather infinite amount of memory available. 650 URLs before OOMing is though a very tiny amount TBH. The exception at least states that the garbate collector is using too much time in trying to free memory which indicates that the max available memory is not enough in general.

One approach to learn what fills your memory is to use a profiler and take a heap dump at certain time intervals and then check the dump for available objects and how much memory they occupy and which object are refering to them. Also, try to force a GC before taking the heap to learn what stays in memory. This way you might see what is preventing the collector from freeing some more memory.

Next, there are a couple of scientific papers (DRUM, VEUNIQ, ...) that research the topic of persisting vistited URLs including a unique check in a performant way. There are a couple of open-source implementations in the works, though most of them are not yet finished (including my approach); DRUMS is maybe the most major approach.

Roman Vottner
  • 12,213
  • 5
  • 46
  • 63
-1

You can keep writing links to a file instead of keeping it in memory. That way you'll have lesser data in memory. You can read from the same file if you want to want to parse to some other link that you found in the past.

Sandeep Kaul
  • 2,957
  • 2
  • 20
  • 36