0

I have working a process that scrapes 20,000 urls, but it currently takes around 20 minutes to complete. I am looking for a way to improve this processing time.

I have experimented with using Thread.Start() but have settled on ThreadPool.QueueUserWorkItem

StartScraping() contains a loop to iterate through many Skus, which will resolve a URL for a product page that we want to check the stock status. We are concerned only if the stock is still available.

GetUrlHtmlText() is the method that does the actual HTML scrape using HttpWebRequest. We determine if the product is available by checking for the presence of a button wuth the content "Tükendi", this will indicate that the stock has been depleted (Sold Out).

Main Loop

public void StartScraping()
    {
        try
        {
            int counter = 0;
            Random rnd = new Random();
            Debug.WriteLine("Starting => " + BrandName);
            foreach (var item in competitionItemList.competitionItems)
            {
                //Thread thread = new Thread(() => StartPerSku(BrandID, item, validProxyList[rnd.Next(0, validProxyList.Count - 1)], StockCheckStatus));
                //thread.Start();
                
                ThreadPool.QueueUserWorkItem(state => StartPerSku(BrandID, item, validProxyList[rnd.Next(0, validProxyList.Count - 1)], StockCheckStatus));
                request++;
                Debug.WriteLine(request);
                counter++;
                if (counter == 10)
                {
                    validProxy = false;
                    proxyIndex++;
                    counter = 0;
                    Thread.Sleep(100);
                    if (validProxyList.Count - 1 <= proxyIndex)
                    {
                        proxyIndex = 0;
                    }
                }
                // tasks.Add(thread); 
            }
        }
        catch (System.Exception ex)
        {
            db.InsertRekabetWinServError(90002, "StartScraping() ==> " + ex.ToString());
            Debug.WriteLine("LetStartAllTask => " + ex.ToString());
        }
    }

Scape logic

private async Task<bool> GetUrlHtmlText()
    {
        if (errorStatus)
        {
            htmlInnerText = "";
            try
            {
                ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12 | SecurityProtocolType.Ssl3;
                ServicePointManager.ServerCertificateValidationCallback += (s, cert, ch, sec) => { return true; };
                ServicePointManager.DefaultConnectionLimit = 10000;

                HttpWebRequest httpRequest = WebRequest.CreateHttp(uri);
                //byte[] bytes = System.Text.Encoding.ASCII.GetBytes(requestXml);

                httpRequest.CookieContainer = new CookieContainer();
                httpRequest.Timeout = 30000;
                httpRequest.AllowAutoRedirect = true;
                httpRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
                httpRequest.ServicePoint.Expect100Continue = false;
                httpRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36";
                httpRequest.Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
                httpRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate;q=0.8");
                httpRequest.Headers.Add(HttpRequestHeader.CacheControl, "no-cache");
                if (proxy != "0")
                {
                    httpRequest.Proxy = new WebProxy(proxy);
                }
                try
                {
                    using (HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse())
                    {
                        if (httpResponse != null)
                        {
                            using (var reader = new System.IO.StreamReader(httpResponse.GetResponseStream(), ASCIIEncoding.UTF8))
                            {
                                if (htmlInnerText.Contains("barcode"))
                                {
                                    htmlInnerText = reader.ReadToEnd();
                                    if (htmlInnerText.Contains(">Tükendi</button>"))
                                    {
                                        tApiResult = "Tükendi";
                                        return await Task.FromResult(false);
                                    }
                                    else
                                    {
                                        httpResponse.Close();
                                        reader.Close();
                                        return await Task.FromResult(true);
                                    }
                                }
                                else
                                {
                                    return await Task.FromResult(false);
                                }
                            }
                        }
                        else
                        {
                            tApiResult = "404 Page"; return await Task.FromResult(false);
                        }
                    }
                }
                catch (WebException ex)
                {
                    tApiResult = "404 Page";
                    Debug.WriteLine("GetUrlHtmlText => (" + barcode + "| Link : " + uri + " ) " + ex.ToString());
                    if (/*!ex.ToString().Contains("402") || !ex.ToString().Contains("502") ||*/ !ex.ToString().Contains("404"))
                    {
                        tApiResult = proxy;
                    }
                    return await Task.FromResult(false);
                }
            }
            catch (Exception ex)
            {
                tApiResult = "404 Page";
                Debug.WriteLine(ex.ToString());
                db.InsertRekabetWinServError(OBJ_CompetitionItem.ID, "GetUrlHtmlText ==> " + ex.ToString());
                //MultiFuncComp.Manager.RekabetBrandManager.mobileNotification.SendNotification("<=A little Exception=>" + barcode, "ex str :>>  " + ex.ToString());
                tApiResult = proxy;
                return await Task.FromResult(false);
            }
        }
        else
        {
            return await Task.FromResult(false);
        }
    }

I have almost 400mbps downstream and upstream speed

Computer features

EDIT:

This question was titled How can I scrape 20.000 url same time in c# and was closed for reason : "Opinion-based - discussions focused on diverse opinions are great, but they just don't fit our format well."

I'm changing my question because I know how, but I am looking for advice specifically on how to improve the speed of this process.

Chris Schaller
  • 13,704
  • 3
  • 43
  • 81
uayazzz
  • 17
  • 3
  • 1
    Maybe this would be better off asked on [Code Review](https://codereview.stackexchange.com/)? – Trevor Apr 27 '21 at 14:59
  • THere are a lot of things that cause delays. Using raw threads to make IO calls. Using WebRequest. Faked "async" calls when all the code is synchronous – Panagiotis Kanavos Apr 27 '21 at 15:00
  • 1
    You can try to query a few dozen URLs at the same time, but if you try to query thousands, then your internet connection will become congested. More parallelism does not always mean more speed. – Olivier Jacot-Descombes Apr 27 '21 at 16:04
  • You may find this interesting: [How to limit the amount of concurrent async I/O operations?](https://stackoverflow.com/questions/10806951/how-to-limit-the-amount-of-concurrent-async-i-o-operations) – Theodor Zoulias Apr 27 '21 at 16:53
  • @PanagiotisKanavos thank you so much for your answer, Could you have a chance to give some detail so that I can fully understand? – uayazzz Apr 27 '21 at 20:41
  • @OlivierJacot-Descombes Do you have any suggestions so that I can progress without clogging? Can a bigger bandwidth solve my problem then? – uayazzz Apr 27 '21 at 20:42
  • With `HttpClient`+`async/await`+.NET Core 3.1 or newer you can at least double the throughput. Also your'e killing TLS security, that's not good. – aepot Apr 27 '21 at 20:42
  • A tip: you can remove `async` and all `await`s in `GetUrlHtmlText()` fluenly just `private Task GetUrlHtmlText()` and `return Task.FromResult(...)`. The method is synchronous but spend PS resources no `async` state machine. Finally you can just return `bool` without wrapping it with `Task`. – aepot Apr 27 '21 at 21:06
  • 1
    linked question: https://stackoverflow.com/a/63206351/12888024 – aepot Apr 27 '21 at 21:11
  • @uayazzz what is the *actual* logic? Are you simply trying to find a page with the string `">Tükendi"` ? Or do you want to do something with the pages? – Panagiotis Kanavos Apr 28 '21 at 07:36
  • @PanagiotisKanavos Hello, I'm doing a scan on the html page for the "correct" return value. The return value of "exhausted" is "false". I do not take any action for these. So, I'm trying to weed out pages that don't say "barcode" and "> Out <". – uayazzz May 03 '21 at 11:30
  • @aepot I will try it , if I can get positive result I will be write there. – uayazzz May 03 '21 at 11:32

0 Answers0