7

I have a C# application that receives an html file. I want to parse and validate it. On output it will return a list of errors or that my html is valid.

Has anyone any idea how can I do this?

Jeroen
  • 60,696
  • 40
  • 206
  • 339
Jeff Norman
  • 1,014
  • 5
  • 25
  • 42
  • possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – Pranay Rana Oct 04 '10 at 09:11
  • 2
    The validation part of this question makes it quite distinct from questions about simply parsing HTML. – Quentin Oct 04 '10 at 09:15
  • That's right, I'm not interested in parsing html, I'm interested in validate it for possible errors. – Jeff Norman Oct 04 '10 at 09:32

3 Answers3

11

I'd run a local instance of the W3C Markup Validation service and communicate with it via the API

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
3

You can use HTML Tidy. There is a wrapper for .NET called TidyManaged

gcores
  • 12,376
  • 2
  • 49
  • 45
  • 1
    TidyManaged does not give any functional dll – Jeff Norman Oct 08 '10 at 13:21
  • Some issues were filed about this, including that the file output doesn't even work (and I confirmed it, despite it apparently being patched already). On the issues page is a link to a version by freethenation that works and requires libtidy32.dll and libtidy64.dll, so I followed gcores's link above and renamed the 32 and 64-bit versions appropriately. Took awhile to figure out, so I thought I'd post that here. – person27 Mar 30 '17 at 02:01
1

There is an obscure DLL in the framework version 1.0 (!) Microsoft.mshtml.dll and that is the only way in the framework to deal with DOM. If HTML is XHTML and a valid XML, then you can use XML but otherwise this is the only chance.

Tim S. Van Haren
  • 8,861
  • 2
  • 30
  • 34
Aliostad
  • 80,612
  • 21
  • 160
  • 208
  • 1
    I'd be amazed that that was the *only* way to deal with DOM. – Quentin Oct 04 '10 at 09:16
  • hmmm, explain me how can you can validate an very elaborate html file with xml. I thought about that too, and I think it's not the best way. – Jeff Norman Oct 04 '10 at 09:37
  • In what framework? Nobody mentioned a framework. (Oh, and must we resort to name calling?) – Quentin Oct 04 '10 at 09:47
  • 3
    It's not so obscure, it the PIA for Internet Explorer. Not part of the framework, it's a COM interop library. Whether IE is a good validator for HTML is, ahem, debatable. – Hans Passant Oct 05 '10 at 02:18