I need to validate a web page markup programmatically and I heard it's possible by using regular expressions. If so how? Is there any other way? (Other than using w3c service)
Asked
Active
Viewed 242 times
1
-
4hehe reminded me of my favourite question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags . – Orkun Feb 20 '12 at 17:57
-
It's definitely not a problem to be solved with regular expressions (see Zortkun's link), but obviously it _is_ possible in other ways or the w3c's service wouldn't exist... – nnnnnn Feb 21 '12 at 02:51
-
1whats wrong about using the w3c service? it's authoritative and can be [queried programmatically](http://validator.w3.org/docs/users.html#Calling). – collapsar Mar 01 '12 at 13:37
1 Answers
2
Use HTML Tidy http://tidy.sourceforge.net/ It both reports on the validity of an HTML document, and can attempt to automatically clean it up. You can run it as a command line application and script it. There are ports or wrappers for it in Java, Perl, and Python.
I also use TagSoup library for Java http://ccil.org/~cowan/XML/tagsoup/ It does a great job of cleaning up badly formatted HTML into valid XML.

Stephen Ostermiller
- 23,933
- 14
- 88
- 109