String to HtmlDocument

Question

I'm fetching the html document by URL using WebClient.DownloadString(url) but then its very hard to find the element content that I'm looking for. Whilst reading around I've spotted HtmlDocument and that it has neat things like GetElementById. How can I populate an HtmlDocument with the html returned by url?

@corei11: http://stackoverflow.com/a/1732454/34397 – SLaks Nov 07 '16 at 18:27 — SLaks, Nov 07 '16 at 18:27

score 33 · Answer 1 · edited May 23 '17 at 12:34

33

Using Html Agility Pack as suggested by SLaks, this becomes very easy:

string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode specificNode = doc.GetElementById("nodeId");
HtmlNodeCollection nodesMatchingXPath = doc.DocumentNode.SelectNodes("x/path/nodes");

edited May 23 '17 at 12:34

Community

1
1

answered Feb 08 '11 at 16:34

Dan Tao

125,917
54
300
447

score 32 · Accepted Answer · edited Nov 23 '17 at 14:48

32

The HtmlDocument class is a wrapper around the native IHtmlDocument2 COM interface.
You cannot easily create it from a string.

You should use the HTML Agility Pack.

edited Nov 23 '17 at 14:48

carla

1,970
1
31
44

answered Feb 08 '11 at 16:18

SLaks

868,454
176
1,908
1,964

3

Since @dhsto has given the accurate answer to this question, I cannot see how this answer can be correct. – ThunderGr Feb 22 '14 at 10:25

score 24 · Answer 3 · answered Aug 24 '13 at 22:35

24

To answer the original question:

HTMLDocument doc = new HTMLDocument();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(fileText);
// now use doc

Then to convert back to a string:

doc.documentElement.outerHTML;

answered Aug 24 '13 at 22:35

David Sherret

101,669
28
188
178

2

It seems like it is not possible to instantiate the `HTMLDocument` like that. – Steinfeld Nov 05 '14 at 09:08
2

@Steinfeld I just did another test and it works for me. Make sure you are `using mshtml;`. It's `Microsoft.mshtml` in references dialogue. I'm using version `7.0.3300` – David Sherret Nov 05 '14 at 14:43
Thanks, I did it a few hours ago and it certainty worked. However I tried agility pack and it seems much "user friendly" =] – Steinfeld Nov 05 '14 at 14:46
1

@Steinfeld yeah, it definitely is! The mshtml library is a huge pain, but it can be good enough for doing simple things. – David Sherret Nov 05 '14 at 14:52
This works, but it will try to open an external `about:blank` page on my environment. – hillin Dec 12 '14 at 06:03
@hillin that's extremely strange. I've never had that happen. – David Sherret Dec 12 '14 at 19:29

score 20 · Answer 4 · answered Oct 30 '15 at 19:07

20

For those who don't want to use HTML agility pack and want to get HtmlDocument from string using native .net code only here is a good article on how to convert string to HtmlDocument

Here is the code block to use

public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
        {
            WebBrowser browser = new WebBrowser();
            browser.ScriptErrorsSuppressed = true;
            browser.DocumentText = html;
            browser.Document.OpenNew(true);
            browser.Document.Write(html);
            browser.Refresh();
            return browser.Document;
        }

answered Oct 30 '15 at 19:07

Nikhil Gaur

1,280
3
19
40

I no longer work in .NET environments so can't test to see if this works. However, I'll happily accept it as the answer if someone else in the community can verify this for me. Thanks for picking this up so many years on! XD – lappy Mar 01 '16 at 15:20
Actually I was searching for this solution but didn't get any solution for doing this without third party libraries. At the end this code worked for me and using this in may app. I hope this will help some guys like me :) – Nikhil Gaur Mar 01 '16 at 15:27

score 4 · Answer 5 · answered Jun 21 '18 at 17:14

I've adapted Nikhil's answer somewhat to simplify it. Admittedly, I have not run it through a .net compiler and there are likely very good reasons for the lines Nikhil put in which I have omitted. However, at least in my use case (a very simple page) they were unnecessary.

My use case was for a quick powershell script:

$htmlText = $(New-Object 
System.Net.WebClient).DownloadString("<URI HERE>") #Get the HTML document from a webserver
$browser = New-Object System.Windows.Forms.WebBrowser
$browser.DocumentText = $htmlText
$browser.Document.Write($htmlText)
$response = $browser.document

For my case, this returned an HTMLDocument object with HTMLElement objects in it, instead of __ComObject object types (which are a challenge to use in powershell class code) returned by a call to Invoke-WebRequest in PS 5.1.14393.1944

I believe the equivalent C# code is:

public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
{
    WebBrowser browser = new WebBrowser();
    browser.DocumentText = html;
    browser.Document.Write(html);
    return browser.Document;
}

This is good, but you need to run: [void][reflection.assembly]::LoadWithPartialName("System.Windows.Forms") before you can create a System.Windows.Forms.WebBrowser object — Frank Lesniak, Aug 08 '18 at 20:58
@Frank Lesniak Are you sure that's not version dependant? I haven't needed to use loadwithpartialname unless i'm calling a DLL which isn't part of the .net assembly cache. I certainly didn't use it with this code (the powershell version). Or are you saying that call is necessary for the C# code? I would have expected a project library and a using statement. — Takophiliac, Aug 10 '18 at 18:20
Hey Takophiliac, you might be onto something. On PowerShell 5.1, I too can create a new System.Net.WebClient object. Unfortunately I do not remember what version I was testing on... but I do a lot of downlevel / backward-compatible work, so it very well could have been PowerShell v1 or v2. — Frank Lesniak, Oct 10 '18 at 03:22

score 2 · Answer 6 · answered Jun 27 '12 at 23:45

2

you could get a htmldocument by:

 System.Net.WebClient wc = new System.Net.WebClient();

 System.IO.Stream stream = wc.OpenRead(url);
 System.IO.StreamReader reader = new System.IO.StreamReader(stream);
 string s = reader.ReadToEnd();

 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(s);

so you have getbiyid and getbyname ... but any further you'd better of with
HTML Agility Pack as suggested . f.e you can do: doc.DocumentNode.SelectNodes(xpathselector) or regex to parse the doc ..

btw: why not regex ? . its soo cool if you can use it right... but xpath is also very mighty ... so "choose your poison"

cu

answered Jun 27 '12 at 23:45

womd

3,077
26
20

2

HtmlDocument doesnt seem to have .LoadHtml() for me – Photonic Oct 08 '15 at 09:10
@Photonic But for me it does. Working here. – C4d May 18 '16 at 20:52
2

@C4u what namespace is your `HtmlDocument` in? I'm using `System.Windows.Forms.HtmlDocument` and there's no `LoadHtml()`. – Scott Baker Jan 13 '17 at 18:40
There we got the difference. It is `HtmlAgilityPack.HtmlDocument`. – C4d Jan 16 '17 at 10:46

score 0 · Answer 7 · answered Feb 08 '11 at 16:22

0

You could try with OpenNew and then with Write but that's a bit strange use of that class. More info on MSDN.

answered Feb 08 '11 at 16:22

Beku

395
3
8

But you can't create an instance at all. That requires an existing instance. – SLaks Feb 08 '11 at 16:32
I put this in the form's Load handler: webBrowser1.DocumentText = Properties.Resources.HtmlContent; – radsdau Jul 14 '15 at 04:12
@SLaks wb = new webbrower(); wb.DocumentText(""); htmldoc = wb.Document().OpenNew(true); htmldoc.Write(""); <= this works for me just fine – MattyMatt Nov 20 '19 at 21:42

String to HtmlDocument

7 Answers7

Linked