0

I am writing a multithreaded webcrawler, where there is one WebCrawler object which uses an ExecutorService to process WebPages and extract anchors from each page. I have a method defined in the WebCrawler class which can be called by WebPages to add extracted sublinks to the WebCrawler's Set of nextPagestoVisit, and the method currently looks like this:

public synchronized void addSublinks(Set<WebPage> sublinks) {
    this.nextPagestoVisit.addAll(sublinks);
}

Currently I am using a synchronized method. However, I am considering other possible options.

  1. Making the Set a synchronizedSet:

    public Set<WebPage> nextPagestoVisit = Collections.synchronizedSet(new HashSet<WebPage>());
    
  2. Making the Set volatile:

    public volatile Set<WebPage> nextPagestoVisit = new HashSet<WebPage>();
    

Are both of these two alternatives sufficient on their own? (I am assuming that the synchronized method approach is sufficient). Or would I have to combine them with other safety measures? If they all work, which one would be the best approach? If one or both do not work, please provide a short explanation of why (ie. what kind of scenario would cause problems). Thanks

Edit: To be clear, my goal is to ensure that if two WebPages both try to add their sublinks at the same time, one write will not be overwritten by the other (ie. all sublinks will successfully be added to the Set).

b_pcakes
  • 2,452
  • 3
  • 28
  • 45

2 Answers2

0

I am not sure that you know what the volatile keyword actually does. It does not ensure mutual exclusion. Quoting from here :

"Using volatile, on the other hand, forces all accesses (read or write) to the volatile variable to occur to main memory, effectively keeping the volatile variable out of CPU caches. This can be useful for some actions where it is simply required that visibility of the variable be correct and order of accesses is not important."

You do have however several alternatives:

  • Using a synchronized block

    synchronized {
    //synchronized code
    }
    
  • Using alternatives like semaphores

    Semaphore semaphore,
    semaphore.aquire()
    ...
    semaphore.release()
    

Again, note that you are saying you are trying to achieve synchronized access. If all you need is to ensure that the variable is the freshest possible always the volatile is a fairly simple solution.

Community
  • 1
  • 1
Rafael Saraiva
  • 908
  • 2
  • 11
  • 23
  • Please see my edit above. I wish to ensure basically that if two threads both write to the Set simultaneously, both will successfully write all of their sublinks to the set. I guess the question in this case then is whether adding an element to a set requires multiple operations (ie reading the set and then adding to it). If it didn't, it seems like volatile would be sufficient, correct? – b_pcakes Nov 12 '15 at 20:22
0

Making the variable that holds the set volatile will do nothing for you. For a start this only affects the "pointer" to the set, not the set itself. Then it means the atomic updates to the pointer will be seen by all threads. It does nothing for the Set.

Making the Set a synchronizedSet does what you want. As would either synchronized blocks or Semaphores. However both would add more boilerplate than just using synchronizedSet and are an additional vector for bugs.

Michael Lloyd Lee mlk
  • 14,561
  • 3
  • 44
  • 81
  • 1
    To clarify, I am assuming you meant to say "Making the variable that holds the set volatile will do nothing for you"? – b_pcakes Nov 12 '15 at 20:16