I want to Crawl data using multiple URL's and store the data in SQLite, Should I use Parallel. Invoke or parallel for each loop too crawl the URL and fetch the data. I am confused on how to execute this part of my project. I am also struggling on how to start this part of my project which actually crawls articles from different languages in a website
Asked
Active
Viewed 136 times
0
-
Start by working out how to crawl one URL at a time, and then once you've got that down you can branch out to parallel. – ProgrammingLlama Apr 07 '22 at 08:58
-
I used HttpClient and using htmlagility pack to parse the html page is the way forward? – Vedant iyer Apr 07 '22 at 09:00
-
What version of .NET are you using? .NET 6? – Theodor Zoulias Apr 07 '22 at 09:01
-
Are all the URLs known upfront, or more URLs might be added on the go? – Theodor Zoulias Apr 07 '22 at 09:02
-
yes .net6 is what i am using – Vedant iyer Apr 07 '22 at 09:03
-
@TheodorZoulias can we connect on linkdIN? if u dont mind – Vedant iyer Apr 07 '22 at 09:05
-
Is this question helpful? [Parallel foreach with asynchronous lambda](https://stackoverflow.com/questions/15136542/parallel-foreach-with-asynchronous-lambda). Particularly Majid Shahabfar's answer. – Theodor Zoulias Apr 07 '22 at 09:06
-
@TheodorZoulias is there any other way i can reach out to u ? If u can help me out in this will be very helpful for me – Vedant iyer Apr 07 '22 at 09:07
-
@TheodorZoulias yes the article is helpful for sure but if i may ask you so to store the data from these URLs i will obviously use SQLite and create two methods of insert data and query data and using entity framework i will create a schema and store the articles in this url? – Vedant iyer Apr 07 '22 at 09:13
-
3Vedant we are here to help you solve specific problems. This is a Q-A site. If you are in a situation that you don't know what to do and where to start, it will be difficult to get help here. You might want to search for tutorials in the internet, or find some suitable forum and start a discussion there. – Theodor Zoulias Apr 07 '22 at 09:14
-
2Take in mind that you are making requests to a server. Be respectful or you will probably be banned. Don't make too many requests in a short time. – Victor Apr 07 '22 at 09:15
-
Please provide enough code so others can better understand or reproduce the problem. – Community Apr 07 '22 at 10:46
1 Answers
1
TPL (task parallel library) vs. async/await is the question about, is your task CPU bound (calculate multiple things in parallel) or I/O bound (interact with multiple files or network requests).
Due to the fact, that you like to crawl multiple URLs, your jobs is I/O bound, which makes it a good candidate for async/await. So you could request all (or a subset) of your list in parallel. Some example code would look something like this:
public async Task<IReadOnlyList<string>> GetContent(IEnumerable<string> urls)
{
var tasks = urls.Select(GetContent);
return await Task.WhenAll(tasks);
}
private async Task<string> GetContent(string url)
{
var content = await httpClient.GetStringAsync(url);
}

Oliver
- 43,366
- 8
- 94
- 151