2

I am having 10 web crawlers which share a LinkedBlockingQueue.

From my debug view in Eclipse I was able to find out that when I have fetched several URLs (about 1000), the list.take() call takes very long.

This is how it works:

private (synchronized) URL getNextPage() throws CrawlerException {
    URL url;
    try {
        System.out.println(queue.size());
        url = queue.take();
    } catch (InterruptedException e) {
        throw new CrawlerException();
    }
    return url;
}

I only added synchronized and queue.size() for debugging purposes to see if the list is really filled when take() gets called. Yes, it is (1350 elements in this run).

queue.put() on the other hand gets only called when a URL is really new:

private void appendLinksToQueue(List<URL> links) throws CrawlerException {
    for (URL url : links) {
        try {
            if (!visited.contains(url) && !queue.contains(url)) {
                queue.put(url);
            }
        } catch (InterruptedException e) {
            throw new CrawlerException();
        }
    }
}

However, all other Crawlers do not seem to procude too many new URLs either, so the queue should not really block. This is how many URLs we have in the queue (in 5 second interval):

Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1355
Currently we have sites: 1355
Currently we have sites: 1355

According to the Java doc contains() is inherited from AbstractCollection so I guess that this does not have at least anything to do with multithreading and thus cannot be the reason for blocking either.

Point is, from my debugging I can also see that the other threads also seem to be blocked in list.take(). However, it’s not an eternal block. Sometimes on of the crawler can go on, but they are stuck for more than a minute. Currently, I cannot see any of them going on.

Do you know how this could happen?

aufziehvogel
  • 7,167
  • 5
  • 34
  • 56
  • 1
    Can you make a thread dump when threads are blocked in `take()` and post it here? – Tomasz Nurkiewicz Jul 14 '12 at 16:40
  • 1
    That `contains` call on the LinkedBlockingQueue is going to be slow - it has to search the whole queue - and will prevent any access in the meantime. Can you instead put urls into the visited set before putting them into the queue? (This also avoids a race condition where another thread adds the url to the queue after you call contains on it.) – Alan Stokes Jul 14 '12 at 16:44
  • Why not use a thread pool? It seems like you're doing a producer-consumer pattern. – Tudor Jul 14 '12 at 16:46
  • @Thomasz How can I do this? I tried rightclick on the thread + *Copy stack* but I cannot paste it anywhere (all that gets copied is the title of the thread). – aufziehvogel Jul 14 '12 at 16:52
  • @Alan Yeah, that’s right. I also just saw this in the debug, it’s also very slow. But I did not know that contains() will block get(). The docs only reference to AbstractCollection and do not explain contains() any further. – aufziehvogel Jul 14 '12 at 16:52
  • 1
    It's also a race condition. Two threads could both see that the URL isn't there, and then both put it in. A checks, B checks, A puts, B puts. – yshavit Jul 14 '12 at 17:03
  • That `synchronized` has no effect if you don't syncronize the `put`s as well. – Marko Topolnik Jul 14 '12 at 19:00
  • Update: Adding a new set for already known website helped a lot for the `take` process, but the sets are also pretty slow. According to a friend, that’s due to the synchronizing effort (I call `contains` for pretty many URLs per second). So I guess I gotta tune this somehow. – aufziehvogel Jul 14 '12 at 19:16
  • You could use a `ConcurrentHashMap` as a concurrent set (use dummy values); that should help. – Alan Stokes Jul 14 '12 at 20:17
  • `ConcurrentHashMap` also breaks down at about 1000 URLs inside (using `containsKey()`, but also using `contains()` but that was clear). So the problem really seems to be the synchronization overhead, but then I am wondering why it does not happen from the start (because I always find hundreds new URLs per second). And `containsKey()` should be in O(1). – aufziehvogel Jul 14 '12 at 20:47
  • It's not the synchronization overhead: it is unmeasurable on this scale. It's your algorithm. Searching the queue for contains() is O(N). – user207421 Jul 15 '12 at 00:38
  • @EJP: As for a Queue yes, but not as for a HashSet or keys in a HashMap, then it’s O(1), but still the `.contains()` and `.containsKey()` operations are really slow with HashSet respectively HashMap. – aufziehvogel Jul 15 '12 at 07:48
  • Actually, I do not know why but now it seems to work well with HashSet both with 1 and 10 parallel threads. However, I do not see anything I changed. I had commented out a lot of additional stuff before, but now I uncommented it again and it still works. – aufziehvogel Jul 16 '12 at 17:40
  • Sorry, what data structure are you using for visited? – vitaly Sep 06 '12 at 16:02

0 Answers0