Does .NET framework offer methods to parse an HTML string?

Question

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:

find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements

Are there methods available in .net to do so?

I know... [use regex](http://stackoverflow.com/a/1732454/119477) — Conrad Frix, Feb 27 '12 at 22:42
I don't know... don't use regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 — L.B, Feb 27 '12 at 22:44
If your HTML happens to be XHTML, then you could use the standard XML libraries for parsing, traversing, and modifying it. — Douglas, Feb 27 '12 at 22:46
[MSHTML](http://msdn.microsoft.com/en-us/library/ie/bb498651%28v=vs.85%29.aspx) Here is a SO link: http://stackoverflow.com/a/56228/284240 — Tim Schmelter, Feb 27 '12 at 22:46
The short answer is no. The Agility Pack is the closest thing there is to a sanctioned (.NET) HTML parser. — porges, Feb 27 '12 at 22:47
Why "I can't use HTMLAgilityPack" ? Seems silly to rule out a very good (and even free) tool. — H H, Feb 27 '12 at 22:48
@Henk, because I'm working on a mobile platform with very restricted bandwidth and using HTMLAgilityPack would require a dll download that we can't afford. I am only wondering if there's a default set of HTML string parsing methods as part of .NET that I'm not aware of. — Jelly Ama, Feb 27 '12 at 23:00
@jelly - then list that platform with all details and versions. — H H, Feb 27 '12 at 23:01

Onur · Accepted Answer · 2012-02-27T23:32:01.853

5

HtmlDocument

GetElementById

HtmlElement

You can create a dummy html document.

WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();

Output:

2

file:///c:

about:myUrl

Editing elements:

HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
        "src=\"c:\"",
        "src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));

Output:

file:///d:

edited Feb 27 '12 at 23:32

answered Feb 27 '12 at 22:44

Onur

599
4
12

3

This requires you to load up the document in a Winforms control. – porges Feb 27 '12 at 22:46
Correct me if I'm wrong but this requires a webBrowser control and doesn't allow for direct HTML string parsing. – Jelly Ama Feb 27 '12 at 22:47
@JellyAma, yes, but isn't it what you seem to want in "modify (and ideally create) attributes of those found elements"? – Alexei Levenkov Feb 27 '12 at 22:49
@Alexei, most importantly, I need to parse strings of HTML. – Jelly Ama Feb 27 '12 at 23:02

score 1 · Answer 2 · answered Feb 27 '12 at 22:47

1

Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

answered Feb 27 '12 at 22:47

Doug

37
4

5

Try to parse this *well formed* html. `line1` **
** `line2` – L.B Feb 27 '12 at 23:16

score 1 · Answer 3 · edited May 23 '17 at 11:46

1

Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:

Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?

One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here

Also, here is a post about Linq vs Regex.

edited May 23 '17 at 11:46

Community

1
1

answered Feb 27 '12 at 23:07

Spencer

375
1
5
17

1

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 – L.B Feb 27 '12 at 23:21

score 0 · Answer 4 · answered Feb 27 '12 at 22:51

You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

Does .NET framework offer methods to parse an HTML string?

4 Answers4

Linked