9

Is there anyway to get AngleSharp to not create a full HTML document when parsed a fragment. For example, if I parse:

<title>The Title</title>

I get a full HTML document in DocumentElement.OuterHtml:

<html><head><title>The Title</title></head><body></body></html>

If I parse:

<p>The Paragraph</p>

I get another full HTML document:

<html><head></head><body><p>Hey</p></body></html>

Notice that AngleSharp is smart enough to know where my fragment should go. In one case, it puts it in the HEAD tag, and in the other case, it puts it in the BODY tag.

This is clever, but if I just want the fragment back out, I don't know where to get it. So, I can't just call Body.InnerHtml because depending on the HTML I parsed, my fragment might be in the Head.InnerHtml instead.

Is there a way to get AngleSharp to not create a full document, or is there some other way to get my isolated fragment back out after parsing?

Deane
  • 8,269
  • 12
  • 58
  • 108

2 Answers2

6

It is possible now. Below is an example copied from https://github.com/AngleSharp/AngleSharp/issues/594

var fragment = "<script>deane</script><div>deane</div>";
var p = new HtmlParser();
var dom = p.Parse("<html><body></body></html>");
var nodes = p.ParseFragment(fragment, dom.Body);

The second parameter of ParseFragment is used to specify the context in which the fragment is parsed. In your case you will need to parse the <title> in the context of dom.Head and the p in dom.Body.

Oh wow, it is OPs own code which I have just copied.

jakubiszon
  • 3,229
  • 1
  • 27
  • 41
  • 1
    FYI you can omit the markup from the first parse: `var dom = p.Parse(string.Empty)` and you get a barebones HTML doc still: `` – Ben Feb 12 '20 at 10:03
2

I have learned that this is not possible. AngleSharp is designed to generate a DOM exactly like the HTML spec says to do it. If you create an HTML document with the code I have above, open it in a browser, then inspect the DOM, you'll find the exact same situation. AngleSharp is in compliance.

What you can do is parse it as XML with errors suppressed, which should cause the document to self-correct dirty HTML issues, and give you a "clean" document which can then be manipulated.

var html = "<x><y><z>foo</y></z></x>";
var options = new XmlParserOptions()
{
    IsSuppressingErrors = true
};
var dom = new XmlParser(options).Parse(html);

There is one problem in here, in that it doesn't handle entities perfectly (meaning it still throws some errors on these, even when supressed). It's on the list to be fixed.

Here's the GitHub issue that led to this answer:

https://github.com/AngleSharp/AngleSharp/issues/398

Deane
  • 8,269
  • 12
  • 58
  • 108