0

I want to know the input html string is vaild or not. I researched various HTML parser. But anything doesn't have validating html method. Jsoup is almost same what I want. But it generates valid parsed html. Basically I want to check valid html structure as below.

<html>
<head>~</head>
<body>~</body>
</html>

So, I wrote code in Java.

String html = "<html><head><title>asdf</title></Head><body>asfd</body></html>";
String compile = "(?i)<html.*>.*<head>.*?</head>.*<body>.*</body>.*</html>";
Pattern pattern = Pattern.compile(compile);
Matcher matcher = pattern.matcher(html);
if (matcher.matches()) {
    System.out.println("Valid html");
} else {
    System.out.println("Invalid html");
}

But if html has 2 of <head> element, it also checks valid html. How to check valid html structure efficiently?

gentlejo
  • 2,690
  • 4
  • 26
  • 31
  • 1
    Tried pushing it through an XML parser and making sure a html, head and body element are present? – Steven Feb 07 '12 at 04:19
  • 2
    Naturally; http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. If all you're looking for is a `` and `` inside an ``, then just search for the `indexOf` each, starting from the end of the last thing found, and make sure they're all there in order. – Dave Newton Feb 07 '12 at 04:22
  • Steven// Can you recommend some XML parser? – gentlejo Feb 07 '12 at 04:49
  • 3
    Yet another "How can I do X using Y?" Where Y is an entirely inappropriate tool to achieve X. – Andrew Thompson Feb 07 '12 at 05:03
  • possible duplicate of [Question about parsing HTML using Regex and Java](http://stackoverflow.com/questions/2394457/question-about-parsing-html-using-regex-and-java) – Brian Roach Feb 07 '12 at 05:12
  • Standard Java (not ancient versions) have it's own XML parses if you want to use them. See http://www.java-tips.org/java-se-tips/javax.xml.parsers/how-to-read-xml-file-in-java.html – Steven Feb 07 '12 at 05:18

1 Answers1

1

How about using some library to do it? I recommend JSoup.

George
  • 4,029
  • 2
  • 22
  • 26