I am scraping content from a website. I have an async
method that visits the pages recursively and scrapes the content from them. In this recursive function I am passing a HashSet
and a List
. The List
is to collect the content of all pages and the Hashset
is to store already visited links so that we don't visit them again. The relevant portion of this function is as follows:
public async Task ScrapeContentRecAsync(string uri, List<Content> allContent, HashSet<string> alreadyVisited) {
...
var pageHtml = await httpClient.GetStringAsync(uri);
alreadyVisited.Add(uri);
...
allContent.Add(someContent);
...
var newLinks = FindAllCrawlableLinks(pageHtml);
foreach(var newLink in newLinks) {
await ScrapeContentRecAsync(newLink, allContent, alreadyVisited);
}
}
As you can see that I am awaiting on each new link that can be scraped (don't suggest optimisation by launching parallel tasks/parallel calls because I am asked not to do that). So basically as soon as we find a new link we recurse for it. The new call adds the new scraped data to the allContent
list and the new link is also added to alreadyVisited
. So in simple terms it is a preorder
DFS
of the webpages tree.
The application is a console application so there is no SynchronizationContext
and default TaskScheduler
i.e. the code after the await
will be executed on a thread pool thread.
Now, in the old school ways, whenever there are multiple threads adding to a list, we used a lock which ensured only one thread added to the list and the lock also made sure that any changes to the guarded variable are visible to other threads.
Since, my continuations can be executed on any thread pool thread, there is a chance that different threads are handling the recursive calls and adding to both the list and hashset collection.
Will the changes made to the collections on one thread pool thread visible to other threads?
Can there be concurrency issues in the above scenario?
If I would have launched multiple recursive calls in parallel (optimisation), then would I have surely needed a thread safe collection?