1

I need to validate a web page markup programmatically and I heard it's possible by using regular expressions. If so how? Is there any other way? (Other than using w3c service)

cdeszaq
  • 30,869
  • 25
  • 117
  • 173
Cassini
  • 119
  • 1
  • 13
  • 4
    hehe reminded me of my favourite question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . – Orkun Feb 20 '12 at 17:57
  • It's definitely not a problem to be solved with regular expressions (see Zortkun's link), but obviously it _is_ possible in other ways or the w3c's service wouldn't exist... – nnnnnn Feb 21 '12 at 02:51
  • 1
    whats wrong about using the w3c service? it's authoritative and can be [queried programmatically](http://validator.w3.org/docs/users.html#Calling). – collapsar Mar 01 '12 at 13:37

1 Answers1

2

Use HTML Tidy http://tidy.sourceforge.net/ It both reports on the validity of an HTML document, and can attempt to automatically clean it up. You can run it as a command line application and script it. There are ports or wrappers for it in Java, Perl, and Python.

I also use TagSoup library for Java http://ccil.org/~cowan/XML/tagsoup/ It does a great job of cleaning up badly formatted HTML into valid XML.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109