I am having 10 web crawlers which share a LinkedBlockingQueue
.
From my debug view in Eclipse I was able to find out that when I have fetched several URLs (about 1000), the list.take()
call takes very long.
This is how it works:
private (synchronized) URL getNextPage() throws CrawlerException {
URL url;
try {
System.out.println(queue.size());
url = queue.take();
} catch (InterruptedException e) {
throw new CrawlerException();
}
return url;
}
I only added synchronized
and queue.size()
for debugging purposes to see if the list is really filled when take()
gets called. Yes, it is (1350 elements in this run).
queue.put()
on the other hand gets only called when a URL is really new:
private void appendLinksToQueue(List<URL> links) throws CrawlerException {
for (URL url : links) {
try {
if (!visited.contains(url) && !queue.contains(url)) {
queue.put(url);
}
} catch (InterruptedException e) {
throw new CrawlerException();
}
}
}
However, all other Crawlers do not seem to procude too many new URLs either, so the queue should not really block. This is how many URLs we have in the queue (in 5 second interval):
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1355
Currently we have sites: 1355
Currently we have sites: 1355
According to the Java doc contains()
is inherited from AbstractCollection
so I guess that this does not have at least anything to do with multithreading and thus cannot be the reason for blocking either.
Point is, from my debugging I can also see that the other threads also seem to be blocked in list.take(). However, it’s not an eternal block. Sometimes on of the crawler can go on, but they are stuck for more than a minute. Currently, I cannot see any of them going on.
Do you know how this could happen?