Can I use multithreading and parallel programming for web scraping?

Question

I having a hard time understanding multithreading and parallel programming. I have a small application (Scraper). I am using Selenium with C# .NET. I have a file that contains addresses from business. I then use my scraper to look for company name and their website. After that I do another scraping for generic email address based on their company site

Here is the issue. If I do this manually it will take me 3 years to complete a 50,000 records. I made the math. Lol. That's why I created the scraper. A normal console application took 5 to 6 days to complete. Then, I decided maybe using multithreading and parallel programming could reduce the time.

So, I did a small sample test. I noticed that 1 record took 10 sec. To finish. Then with 10 record it took 100 sec. My question is why multithreading took the same time?

I am not sure if my expectations and understanding of multithreading is wrong. I thought by using Parallel.ForEach will launch all ten record and finish at 10 sec saving me 90 sec. Is this the correct assumption? Can someone please clarify me how actually multithreading and parallel programming works?

private static List<GoogleList> MultiTreadMain(List<FileStructure> values)
{
        List<GoogleList> ListGInfo = new List<GoogleList>();
        var threads = new List<Thread>();
        Parallel.ForEach (values, value =>
        {
            if (value.ID <= 10)
            {
                List<GoogleList> SingleListGInfo = new List<GoogleList>();
                var threadDesc = new Thread(() =>
                {
                   lock (lockObjDec)
                   {
                      SingleListGInfo = LoadBrowser("https://www.google.com", value.Address, value.City, value.State,
                                 value.FirstName, value.LastName,
                                 "USA", value.ZipCode, value.ID);
                        SingleListGInfo.ForEach(p => ListGInfo.Add(p));
                    }
                });
                threadDesc.Name = value.ID.ToString();
                threadDesc.Start();
                threads.Add(threadDesc);

            }
        });

        while (threads.Count > 0)
        {
            for (var x = (threads.Count - 1); x > -1; x--)
            {
                if (((Thread)threads[x]).ThreadState == System.Threading.ThreadState.Stopped)
                {
                    ((Thread)threads[x]).Abort();
                    threads.RemoveAt(x);
                }
            }
            Thread.Sleep(1);
        }
     

       return ListGInfo;
}

Multithreading is not always faster. First, your network latency doesn't get any shorter. It actually can get worse, because you're increasing traffic on your network connection. Second, multithreading doesn't improve the amount of time the server takes to respond to a request - it can actually slow it down because of increased load on the servier. Third, Google *CPU context switching*. — Ken White, Oct 03 '21 at 23:01
If you have CPU intensive work - Parallel.ForEach, If you have IO (read/write http/file/whatever other async controller) - use Tasks. Assuming you are just scraping web sites, you should just use async+Task paradigm (because, there is no need to wait 10 second on full fledged CPU intensive Thread which Parallel spawns). Tasks are light, and process async responses from websites by signaling back, rather than spin lock waiting. Your main consern in scraping by my experience - async+memory pooling where possible+many IPs — eocron, Oct 03 '21 at 23:01
> I thought by using parallel.Foreach will launch all ten record and finish at 10 sec saving me 90 sec. Yes. that assumption is correct. If your code behaves differently, there is a problem with something else. — Avo Nappo, Oct 03 '21 at 23:02
`So, I did a small sample test.` We can't comment on code that we can't see. — mjwills, Oct 03 '21 at 23:20
Is this .NET Core or Framework? Which version? Console or web app (yes, it makes a difference)? — mjwills, Oct 03 '21 at 23:21
Last time I checked (quite a long time ago), the Selenium was not reacting positively to multithreading. Some libraries/components are designed so that they can be called by one thread only, and maybe Selenium is one of them. — Theodor Zoulias, Oct 03 '21 at 23:24
Your first mistake was to use multi-threading with I/O bound operations. — , Oct 03 '21 at 23:27
Thanks all for responding. Your insight is helping to understand better. — SANOSUKE, Oct 04 '21 at 00:52

score -1 · Answer 1 · answered Oct 03 '21 at 23:50

This is probably not the answer to the specific problem you are facing, but it might be a hint to the general question "why isn't multithreading faster". Let's say that the Selenium has a public class EdgeDriver which is implemented like this:

public class EdgeDriver
{
    private static object _locker = new();

    public void GoToUrl(string url)
    {
        lock (_locker)
        {
            GoToUrlInternal(url);
        }
    }

    internal void GoToUrlInternal(string url) //...
}

You, as a consumer of the class, cannot see the private _locker field or the internal methods. These are implementation details, hidden from you, and the only way to know what this class is doing is by reading the documentation. So if the implementation looks like the above contrived example, any attempt to speed up your program by creating multiple EdgeDriver instances and invoking their GoToUrl method in a Parallel.ForEach loop, will be for naught. The lock on a static object will ensure that only one thread at a time will be allowed to invoke the GoToUrlInternal, and all the other threads will have to wait for their turn. This is called "the calls are serialized". And that's just one of the many possible reasons why multithreading may not be faster than code running on a single thread.

Thank you for your answer. Does that mean that if I remove "Lock" it will not become a sequential process? It will speed up the process? — SANOSUKE, Oct 04 '21 at 00:51
@SANOSUKE the `lock` is hypothetical. I have no specific knowledge about the internals of the Selenium library. If there is indeed a `lock` there, for whatever reason, you won't be able to do anything about it yourself. You'll have to contact the library authors and ask for guidance. One possible answer you might get is: "this behavior is by design". — Theodor Zoulias, Oct 04 '21 at 01:14

score -3 · Answer 2 · answered Oct 05 '21 at 21:09

I hope the below code snippet will give you some direction. I am dividing the work between records in List of FileStructure. Based on the problem statement I don't think there is a necessity for a lock here

private static List<GoogleList> MultiTreadMain(List<FileStructure> values)
{
    var tasks = new List<Task<List<GoogleList>>>();
    var toBeScraped = values.Where(p => p.Id >= 10);
    Parallel.ForEach (toBeScraped, value =>
    {
        Task<List<GoogleList>> task = Task<List<GoogleList>>.Factory.StartNew(() =>
        {
            return ProcessRequestAsync(value);
        });
        tasks.Add(task);
    });

    var mergedTask = Task.WhenAll(tasks);
    List<GoogleList> ListGInfo = new List<GoogleList>();
    
    foreach(var item in mergedTask.GetAwaiter().GetResult())
    {
        ListGInfo.AddRange(item.GetAwaiter().GetResult());
    }

   return ListGInfo;
}

public static List<GoogleList> ProcessRequestAsync(FileStructure value)
{
     List<GoogleList> SingleListGInfo = new List<GoogleList>();
     SingleListGInfo = LoadBrowser("https://www.google.com", value.Address, value.City, value.State,
                         value.FirstName, value.LastName,
                         "USA", value.ZipCode, value.ID);
     SingleListGInfo.ForEach(p => ListGInfo.Add(p));
     return SingleListGInfo;
}

Why are you using a parallel loop in order to create some tasks? Creating tasks with the `Task.Factory.StartNew` method is blazingly fast, and creating them in a simple `foreach` loop will be nearly instantaneous. By using a parallel loop your code now has thread-safery issues. The `List` class [is not thread safe](https://stackoverflow.com/questions/5020486/listt-thread-safety). — Theodor Zoulias, Oct 05 '21 at 22:00
The problem I have is that by using "foreach" it goes through each. In a simple as 10 records is fast. But the target is to go over 50,000 records. Those records resides from the file. That's why I call it List of "FileStructure" in the MultiTreadMain method. The Idea is to send 10 or 100 records at the same time so that the process can finish in less than 6 days of process which will take in a single tread. I am not very familiar with Multi Thread. I was looking in the internet and gave me ideas how to do it. I did notice it was always related to file writing or reading. This case is not. — SANOSUKE, Oct 07 '21 at 15:02

Can I use multithreading and parallel programming for web scraping?

2 Answers2