0

I want to get the html of a web page. Then with this html there are two elements who's xpath I have that I want to read. I have little to zero knowledge on this topic.

When searching I keep seeing examples however they load the url and put the html into a string. However I believe since I have the two xpath's it would be better to download the html of the web page as a html document rather than a string or am I wrong?

using (WebClient client = new WebClient()) {
    string s = client.DownloadString(url);
}

So how do I download the html of a webpage to a html document that I can search?

Hakan Fıstık
  • 16,800
  • 14
  • 110
  • 131
mHelpMe
  • 6,336
  • 24
  • 75
  • 150
  • 1
    Possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – mason Apr 06 '17 at 12:19
  • @mason i should have added I would like to do this without using any 3rd party code. I can't download 3rd party stuff at my work place – mHelpMe Apr 06 '17 at 12:31
  • Could you give some information about XPath queries? – levent Apr 06 '17 at 12:36
  • @levent i got the xpath idea from this question http://stackoverflow.com/questions/18065526/pulling-data-from-a-webpage-parsing-it-for-specific-pieces-and-displaying-it – mHelpMe Apr 06 '17 at 12:37
  • That's silly. Why not use a library dedicated to the task? – mason Apr 06 '17 at 12:47
  • :) What I want to actually to know is what type of search will you do in html. The answer on the link you mentioned is already clear enough. – levent Apr 06 '17 at 12:48
  • @mason sorry what is silly? – mHelpMe Apr 06 '17 at 13:16
  • Not being able to use 3rd party code. Why do they do that? Security? You'll introduce far more security bugs by writing your own implementation. You'll also be far less productive. There's a reason 3rd party libraries exist - it's far more efficient to reuse what already exists than to reinvent the wheel. – mason Apr 06 '17 at 13:18
  • @mason preaching to the converted! Sadly they disagree & quote security at me – mHelpMe Apr 06 '17 at 13:24
  • @mHelpMe Open source 3rd party components should be secure enough. – Vojtěch Dohnal Apr 06 '17 at 13:46

2 Answers2

1

This is how i do this.

  1. So first you define your url in string variable.
  2. Then you download the string with HttpWebRequest class.
  3. I use HtmlAgilityPack, so you should include it in your project (using Nugger for example).
  4. Create object of HtmlDocument, and load data to this object.
  5. Now you can navigate over your HtmlDocument.

     string urlAddress = "url.com";
    
     HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
     HttpWebResponse response = (HttpWebResponse)request.GetResponse();
     string data = "";
     if (response.StatusCode == HttpStatusCode.OK)
     {
     Stream receiveStream = response.GetResponseStream();
     StreamReader readStream = null;
    
     if (response.CharacterSet == null)
     {
         readStream = new StreamReader(receiveStream);
     }
     else
     {
         readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
     }
    
     data = readStream.ReadToEnd();
    
    
     response.Close();
     readStream.Close();
    }
    
     HtmlDocument document2 = new HtmlAgilityPack.HtmlDocument();
     document2.LoadHtml(data);
    
tadej
  • 701
  • 1
  • 5
  • 22
-2

You can use StreamWriter to write downloaded data into a file:

string s = string.Empty;
using(WebClient client = new WebClient()) 
{
  string s = client.DownloadString(url);
}

using (FileStream fs = new FileStream("test.html", FileMode.Create)) 
 { 
  using (StreamWriter w = new StreamWriter(fs, Encoding.UTF8)) 
   { 
    w.WriteLine(s); 
   } 
  } 
Doğa Gençer
  • 124
  • 3
  • 15
  • 2
    You've asked how to download html into a file and then downvoted my answer which contains the exact information even without using any external libraries? Not really cool. – Doğa Gençer Apr 06 '17 at 12:53
  • **it would be better to download the html of the web page as a html document** - he tries to say he needs to find out how to parse the html document as structured document searchable by xpath. https://www.w3schools.com/xml/xpath_intro.asp. I agree that the question is poorly formulated. – Vojtěch Dohnal Apr 06 '17 at 13:40