0

I am trying to download url content using the following method.

public static async Task<string> getURL(string link)
        {
            string result = "";
            using (HttpClient client = new HttpClient())
            using (HttpResponseMessage response = await client.GetAsync(link))
            using (HttpContent content = response.Content)
            {

                result = await content.ReadAsStringAsync();


            }
            return result;
        }

Previously I was directly loading the url through HtmlWeb.Load() method of HtmlAgilitypack. but it takes a lot of time, and i cannot put the code in a parallel for or foreach loop (there are a number of exceptions thrown, and the program ends after a few hundred iterations. I tried even 3 parallel threads, with no improvement). After searching on the internet, i found that writing own url download method might be a good idea. But I am not sure how can I pass one url and get url content from above method, which i copied from here Any ideas? Edit: Caller method is as follows

public static void Download(string link)
        {
            HtmlWeb htmlWeb = new HtmlWeb();
           htmlWeb.OverrideEncoding = Encoding.UTF8;
           HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
           document = htmlWeb.Load(getURL(link));
            if(document != null)
            {
                if(document.DocumentNode.SelectSingleNode("//div[@class='urdu_results']") != null)
                    Console.WriteLine(link);
                    Console.WriteLine(count--);
                {
                    if(document.DocumentNode.SelectNodes(".//div[@class='u']") != null && document.DocumentNode.SelectNodes(".//div[@class='r']") != null)
                    {
                        var uNodes = document.DocumentNode.SelectNodes(".//div[@class='u']");
                        var rNodes = document.DocumentNode.SelectNodes(".//div[@class='r']");
                        if(uNodes.Count == rNodes.Count)
                        {
                            for(int i=0;i<uNodes.Count;i++)
                            {
                                string u = uNodes[i].InnerText.Trim();
                                string r = rNodes[i].InnerText.Trim();
                                string word = u+"\t"+r;
                                if(!words.Contains(word))
                                {
                                File.AppendAllText(output, word+Environment.NewLine);
                                Console.WriteLine(r);
                                words.Add(word);
                                }
                            }
                        }
                    }
                }
            }
        }
Shakir
  • 343
  • 5
  • 23
  • Reuse a single shared static HttpClient in your class. **Do not** instantiate a new HttpClient or wrap it in a using statement. As for your other performance issues its hard to tell since you are posting a single method and not the caller. – maccettura Aug 09 '17 at 16:35
  • Added caller method. it heavily uses html agility pack, but i want to improve at least url download. – Shakir Aug 09 '17 at 16:43
  • You have a couple problems off the bat, you aren't using a single shared static HttpClient like I mentioned earlier. Your `Download` method is not async, and you are not awaiting the `getUrl()` method. – maccettura Aug 09 '17 at 16:45
  • Yup. I have never worked with such a method before, so no idea how to correct it. Didn't understand the code snippet in `getURL()`, probably was bad idea to copy paste in my code. – Shakir Aug 09 '17 at 16:49
  • 1
    I would look into some tutorials on async/await. A few good resources are [here](https://blog.stephencleary.com/2012/02/async-and-await.html), [here](https://msdn.microsoft.com/library/hh191443(vs.110).aspx) and [here](https://stackoverflow.com/questions/14455293/how-and-when-to-use-async-and-await). Basically you will need to change your Download method to: `public static async Task Download(string link)` and then put the keyword `await` in front of `getUrl()`: `document = htmlWeb.Load(await getURL(link));` – maccettura Aug 09 '17 at 16:53
  • Created a shared httpclient. Removed using statements. Still there is compiler problem. `public static string getURL(string link) { string result = ""; HttpResponseMessage response = await Client.GetAsync(link); HttpContent content = response.Content; { result = await content.ReadAsStringAsync(); } return result; }` – Shakir Aug 09 '17 at 17:05
  • `public static async Task getURL(string link)` solved the problem. Not sure if this download method would work as nothing happens in output. – Shakir Aug 09 '17 at 17:13
  • You don't have to remove all the using statements, just the one for HttpClient – maccettura Aug 09 '17 at 17:51
  • i did it. there is something wrong with htmlweb.Load(). It takes a string url, i need a method to consider string output already provided. – Shakir Aug 09 '17 at 17:53
  • You probably want `HtmlDocument` instead of `HtmlWeb`: `var doc = new HtmlDocument(); doc.LoadHtml(html);` – maccettura Aug 09 '17 at 17:58
  • Yes. but it does not enforce utf8, and all unicode text corrupts in httpclient download . now looking for encoding solution – Shakir Aug 09 '17 at 18:06

0 Answers0