0

I am using HtmlAgilityPack 1.11.18 under .Net Core 2.2.

I want to remove all HTML attributes from <p> nodes in an HTML fragment (not a complete document).

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(input);

var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");

foreach (var node in pNodes)
{
    node.Attributes.Remove();
}

return htmlDoc.Text;

This is not doing the trick, am I missing something? The method returns a string which should be the fragment minus the attributes on all <p> elements.

I realize you are not supposed to use RegEx to parse HTML but these are small fragments and I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack, which I only brought in to handle this cleanly.

Patrick
  • 5,526
  • 14
  • 64
  • 101
  • Also relevant: https://stackoverflow.com/questions/20390901/c-sharp-htmlagilitypack-inner-html-dont-change-after-appending-node – Progman Feb 09 '20 at 21:10
  • @Progman No, `InnerHtml()` is a property of the HTML node, I need to wipe out all attributes and then return the resulting `HtmlDocument` minus the attributes. Not just on a single node. `Attributes.Remove()` is a `void`. – Patrick Feb 09 '20 at 21:23
  • The issue might be similar (cached values). Please add a [mcve] to your question which shows that the content is not replaced. – Progman Feb 09 '20 at 21:26

1 Answers1

1

I would prefer a RegEx method so I can remove the dependency on HtmlAgilityPack, which I only brought in to handle this cleanly.

So why not using it for such a task? It sounds like You just want to change <p[^>]*> to <p>*

This is not doing the trick, am I missing something?

Yes. HtmlDocument class is more like bacis class that holds everything that the HTML Agility Pack needs to know about the document before parsing it and any change inside DOM structure that it holds won't be reflected here. I've always tend to use: return htmlDoc.DocumentNode.WriteTo(); as a "the most proper"way instead of returning htmlDoc.Text.

Try this example below:

private static string foo()
{
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml("<div><p class=\"ok\">text</p></div>");

    var pNodes = htmlDoc.DocumentNode.SelectNodes("//p");

    foreach (var node in pNodes)
    {
        node.Attributes.Remove();
    }

    return htmlDoc.DocumentNode.WriteTo();
}

*As @Progman mentioned it is a bad idea, here is the example why:

  • Input: <div><p class=\"ok\" <!-- comment-->>text</p></div> (so You can put anything in the comment, regex wouldn't handle that)
  • Output from HTML Agility Pack: <div><p></p><!-- comment-->>text</div>
Eatos
  • 444
  • 4
  • 12
  • Be careful when using regex for HTML, see https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Progman Feb 09 '20 at 21:48
  • Agree. Tried that so many times, every people should try it to learn the hard way.1) XML/HTML should be parsed using streams and regex needs to scan the entire document, 2) it is far more readable that using magical strings like I presented, 3) You still have examples like `

    >text

    `, when HTML AP handles it well while regex will fail, I'm aware of that.
    – Eatos Feb 09 '20 at 22:03
  • That example worked, I was missing the `.WriteTo()` method, thank you. – Patrick Feb 09 '20 at 22:49