1

I have an editing window that allows my authorized users to enter in HTML which is then stored in a database after they click submit. Unfortunately it is something like this:

<ul class="controls-buttons">
         <li class="sep"></li>
     <li id="home">
<a title="Home" <a="" data-href="x">xx</a></li>
      </ul>

Is there a way that I can check the HTML string before storing it in the database to ensure it is valid HTML markup? For example here note the use of <a and <a

2 Answers2

2

You can load the fragment into the HTML Agility Pack (an HTML parser). The source download contains many example projects showing usage.

Check the ParseErrors property to see if the fragment is valid or not.

Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Thanks but I wonder if this is a bit more than I need. Also I am concerned it's not a Microsoft supported library. Some of the comments I saw suggested people were having problems. Is there some other way using a Microsoft library. All I really need to do is to check that all the tags match up. And that a user does not input something like in my example. –  Dec 06 '12 at 11:24
  • 1
    @Anne - If you want to parse HTML, this is one of the best tools out there. It is widely used. Not sure what this concern about it not being from Microsoft is - they are not the be all and end all of software. In regards to matching tags up - this is still a good solution for HTML. – Oded Dec 06 '12 at 12:01
  • I will look into this more. Thanks –  Dec 06 '12 at 12:02
-3

Your next best approach would be to use C# and .NET with the Regex object. Simply use a regular expression that meets your requirements and validate content against your regular expression.

c0D3l0g1c
  • 3,020
  • 5
  • 33
  • 71
  • Honestly that sounds almost impossible :-( –  Dec 06 '12 at 12:02
  • 1
    You really need to read [this](http://stackoverflow.com/a/1732454/1583) to see why regex is **not** suitable for validating unknown HTML. – Oded Dec 06 '12 at 12:02
  • While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. Given Anne's situation this might be appropriate. As the HTML seems to be somewhat known. – c0D3l0g1c Dec 06 '12 at 12:56
  • My understanding of the situation is that users can type in arbitrary HTML. – Oded Dec 06 '12 at 15:07