1

I am scraping content from a website. I have an async method that visits the pages recursively and scrapes the content from them. In this recursive function I am passing a HashSet and a List. The List is to collect the content of all pages and the Hashset is to store already visited links so that we don't visit them again. The relevant portion of this function is as follows:

public async Task ScrapeContentRecAsync(string uri, List<Content> allContent, HashSet<string> alreadyVisited) {
  ...
  var pageHtml = await httpClient.GetStringAsync(uri);
  alreadyVisited.Add(uri);
  ...
  allContent.Add(someContent);
  ...
  var newLinks = FindAllCrawlableLinks(pageHtml);
  foreach(var newLink in newLinks) {
    await ScrapeContentRecAsync(newLink, allContent, alreadyVisited);
  }
}

As you can see that I am awaiting on each new link that can be scraped (don't suggest optimisation by launching parallel tasks/parallel calls because I am asked not to do that). So basically as soon as we find a new link we recurse for it. The new call adds the new scraped data to the allContent list and the new link is also added to alreadyVisited. So in simple terms it is a preorder DFS of the webpages tree.

The application is a console application so there is no SynchronizationContext and default TaskScheduler i.e. the code after the await will be executed on a thread pool thread.

Now, in the old school ways, whenever there are multiple threads adding to a list, we used a lock which ensured only one thread added to the list and the lock also made sure that any changes to the guarded variable are visible to other threads.

Since, my continuations can be executed on any thread pool thread, there is a chance that different threads are handling the recursive calls and adding to both the list and hashset collection.

  1. Will the changes made to the collections on one thread pool thread visible to other threads?

  2. Can there be concurrency issues in the above scenario?

  3. If I would have launched multiple recursive calls in parallel (optimisation), then would I have surely needed a thread safe collection?

Navjot Singh
  • 678
  • 7
  • 18

2 Answers2

2

Since, my continuations can be executed on any thread pool thread, there is a chance that different threads are handling the recursive calls and adding to both the list and hashset collection.

Yes.

Will the changes made to the collections one thread pool thread visible to other threads?

Yes. await inserts the proper thread barriers for you.

Can there be concurrency issues in the above scenario?

No. The code as-is is asynchronous but serial.

If I would have launched multiple recursive calls in parallel (optimisation), then would I have surely needed a thread safe collection?

Yes. Asynchronous concurrency would require a thread safe collection or a lock.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
1

Will the changes made to the collections one thread pool thread visible to other threads?

Can there be concurrency issues in the above scenario?

Synchronization is differ from parallelism. Here your execution flow is sync so there isn't any race condition over your collections and you don't have thread-safety concerns here. Also all changes visible to all threads.

If I would have launched multiple recursive calls in parallel (optimisation), then would I have surely needed a thread safe collection?

In this case yes, you need ConcurrentBag and other thread-safe collections.

Arman Ebrahimpour
  • 4,252
  • 1
  • 14
  • 46