I'm fetching the html document by URL using WebClient.DownloadString(url)
but then its very hard to find the element content that I'm looking for. Whilst reading around I've spotted HtmlDocument
and that it has neat things like GetElementById
. How can I populate an HtmlDocument
with the html returned by url
?

- 71,308
- 16
- 93
- 135

- 677
- 3
- 8
- 15
-
32+1 for not trying regex. – SLaks Feb 08 '11 at 16:22
-
@SLaks Why is that? – corei11 Nov 07 '16 at 18:20
-
1@corei11: http://stackoverflow.com/a/1732454/34397 – SLaks Nov 07 '16 at 18:27
7 Answers
Using Html Agility Pack as suggested by SLaks, this becomes very easy:
string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode specificNode = doc.GetElementById("nodeId");
HtmlNodeCollection nodesMatchingXPath = doc.DocumentNode.SelectNodes("x/path/nodes");
The HtmlDocument
class is a wrapper around the native IHtmlDocument2
COM interface.
You cannot easily create it from a string.
You should use the HTML Agility Pack.
-
3Since @dhsto has given the accurate answer to this question, I cannot see how this answer can be correct. – ThunderGr Feb 22 '14 at 10:25
To answer the original question:
HTMLDocument doc = new HTMLDocument();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(fileText);
// now use doc
Then to convert back to a string:
doc.documentElement.outerHTML;

- 101,669
- 28
- 188
- 178
-
2It seems like it is not possible to instantiate the `HTMLDocument` like that. – Steinfeld Nov 05 '14 at 09:08
-
2@Steinfeld I just did another test and it works for me. Make sure you are `using mshtml;`. It's `Microsoft.mshtml` in references dialogue. I'm using version `7.0.3300` – David Sherret Nov 05 '14 at 14:43
-
Thanks, I did it a few hours ago and it certainty worked. However I tried agility pack and it seems much "user friendly" =] – Steinfeld Nov 05 '14 at 14:46
-
1@Steinfeld yeah, it definitely is! The mshtml library is a huge pain, but it can be good enough for doing simple things. – David Sherret Nov 05 '14 at 14:52
-
This works, but it will try to open an external `about:blank` page on my environment. – hillin Dec 12 '14 at 06:03
-
For those who don't want to use HTML agility pack and want to get HtmlDocument from string using native .net code only here is a good article on how to convert string to HtmlDocument
Here is the code block to use
public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
{
WebBrowser browser = new WebBrowser();
browser.ScriptErrorsSuppressed = true;
browser.DocumentText = html;
browser.Document.OpenNew(true);
browser.Document.Write(html);
browser.Refresh();
return browser.Document;
}

- 1,280
- 3
- 19
- 40
-
I no longer work in .NET environments so can't test to see if this works. However, I'll happily accept it as the answer if someone else in the community can verify this for me. Thanks for picking this up so many years on! XD – lappy Mar 01 '16 at 15:20
-
Actually I was searching for this solution but didn't get any solution for doing this without third party libraries. At the end this code worked for me and using this in may app. I hope this will help some guys like me :) – Nikhil Gaur Mar 01 '16 at 15:27
I've adapted Nikhil's answer somewhat to simplify it. Admittedly, I have not run it through a .net compiler and there are likely very good reasons for the lines Nikhil put in which I have omitted. However, at least in my use case (a very simple page) they were unnecessary.
My use case was for a quick powershell script:
$htmlText = $(New-Object
System.Net.WebClient).DownloadString("<URI HERE>") #Get the HTML document from a webserver
$browser = New-Object System.Windows.Forms.WebBrowser
$browser.DocumentText = $htmlText
$browser.Document.Write($htmlText)
$response = $browser.document
For my case, this returned an HTMLDocument
object with HTMLElement
objects in it, instead of __ComObject
object types (which are a challenge to use in powershell class code) returned by a call to Invoke-WebRequest
in PS 5.1.14393.1944
I believe the equivalent C# code is:
public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
{
WebBrowser browser = new WebBrowser();
browser.DocumentText = html;
browser.Document.Write(html);
return browser.Document;
}

- 321
- 2
- 5
-
This is good, but you need to run: [void][reflection.assembly]::LoadWithPartialName("System.Windows.Forms") before you can create a System.Windows.Forms.WebBrowser object – Frank Lesniak Aug 08 '18 at 20:58
-
@Frank Lesniak Are you sure that's not version dependant? I haven't needed to use loadwithpartialname unless i'm calling a DLL which isn't part of the .net assembly cache. I certainly didn't use it with this code (the powershell version). Or are you saying that call is necessary for the C# code? I would have expected a project library and a using statement. – Takophiliac Aug 10 '18 at 18:20
-
Hey Takophiliac, you might be onto something. On PowerShell 5.1, I too can create a new System.Net.WebClient object. Unfortunately I do not remember what version I was testing on... but I do a lot of downlevel / backward-compatible work, so it very well could have been PowerShell v1 or v2. – Frank Lesniak Oct 10 '18 at 03:22
you could get a htmldocument by:
System.Net.WebClient wc = new System.Net.WebClient();
System.IO.Stream stream = wc.OpenRead(url);
System.IO.StreamReader reader = new System.IO.StreamReader(stream);
string s = reader.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
so you have getbiyid and getbyname ... but any further you'd better of with
HTML Agility Pack as suggested . f.e
you can do: doc.DocumentNode.SelectNodes(xpathselector)
or regex to parse the doc ..
btw: why not regex ? . its soo cool if you can use it right... but xpath is also very mighty ... so "choose your poison"
cu

- 3,077
- 26
- 20
-
2
-
-
2@C4u what namespace is your `HtmlDocument` in? I'm using `System.Windows.Forms.HtmlDocument` and there's no `LoadHtml()`. – Scott Baker Jan 13 '17 at 18:40
-
You could try with OpenNew and then with Write but that's a bit strange use of that class. More info on MSDN.

- 395
- 3
- 8
-
But you can't create an instance at all. That requires an existing instance. – SLaks Feb 08 '11 at 16:32
-
I put this in the form's Load handler: webBrowser1.DocumentText = Properties.Resources.HtmlContent; – radsdau Jul 14 '15 at 04:12
-
@SLaks wb = new webbrower(); wb.DocumentText(""); htmldoc = wb.Document().OpenNew(true); htmldoc.Write(""); <= this works for me just fine – MattyMatt Nov 20 '19 at 21:42