8

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:

  • find specific controls in the hierarchy by id or by tag
  • modify (and ideally create) attributes of those found elements

Are there methods available in .net to do so?

Jelly Ama
  • 6,701
  • 3
  • 20
  • 23
  • 1
    I know... [use regex](http://stackoverflow.com/a/1732454/119477) – Conrad Frix Feb 27 '12 at 22:42
  • 4
    I don't know... don't use regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 – L.B Feb 27 '12 at 22:44
  • 1
    If your HTML happens to be XHTML, then you could use the standard XML libraries for parsing, traversing, and modifying it. – Douglas Feb 27 '12 at 22:46
  • [MSHTML](http://msdn.microsoft.com/en-us/library/ie/bb498651%28v=vs.85%29.aspx) Here is a SO link: http://stackoverflow.com/a/56228/284240 – Tim Schmelter Feb 27 '12 at 22:46
  • 1
    The short answer is no. The Agility Pack is the closest thing there is to a sanctioned (.NET) HTML parser. – porges Feb 27 '12 at 22:47
  • 1
    Why "I can't use HTMLAgilityPack" ? Seems silly to rule out a very good (and even free) tool. – H H Feb 27 '12 at 22:48
  • @Henk, because I'm working on a mobile platform with very restricted bandwidth and using HTMLAgilityPack would require a dll download that we can't afford. I am only wondering if there's a default set of HTML string parsing methods as part of .NET that I'm not aware of. – Jelly Ama Feb 27 '12 at 23:00
  • 1
    @jelly - then list that platform with all details and versions. – H H Feb 27 '12 at 23:01

4 Answers4

5

HtmlDocument

GetElementById

HtmlElement

You can create a dummy html document.

WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();

Output:

2

file:///c:

about:myUrl

Editing elements:

HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
        "src=\"c:\"",
        "src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));

Output:

file:///d:

Onur
  • 599
  • 4
  • 12
1

Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

Doug
  • 37
  • 4
1

Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:

  • Most obviously - use regex. (System.Text.RegularExpressions)
  • Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
  • Linq?

One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here

Also, here is a post about Linq vs Regex.

Community
  • 1
  • 1
Spencer
  • 375
  • 1
  • 5
  • 17
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 – L.B Feb 27 '12 at 23:21
0

You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

John
  • 434
  • 2
  • 20