0

I want to validate HTML tags and it contents using java. Validation should make sure all the html tags are closed properly. There is no mistake in the tag creation area. For eg

<div id="divIdvalue'></div>

or

<span id\="spanIdval" ,></span>

I need to validate such kind of things. while googling I got a regular expression like this

<(\"[^\"]*\"|'[^']*'|[^'\">])*>

But it wont validate all the HTMLs are closed or not? So how can I add that also with this.

My sample code is attached below. Please help me.

package com.test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class htmlValidator {

    private static Pattern pattern;
    private static Matcher matcher;

    private static final String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";

    public void HTMLTagValidator(){
        pattern = Pattern.compile(HTML_TAG_PATTERN);
    }

    public static boolean validate(final String tag){        
          matcher = pattern.matcher(tag);
          return matcher.matches(); 
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub

        String htmlStr = "<div> <p id=/'bb'>This is first paragraph. This is first paragraph. </p> <span id='spanId'>Yes this is spab</span></div>";

        System.out.println("htmlStr :- "+htmlStr);

        validate(htmlStr);

    }

}
DEVOPS
  • 18,190
  • 34
  • 95
  • 118
  • Use a parser like JSoup. – Martijn Courteaux May 17 '14 at 18:01
  • Treat your html as xml, and may be you can use parser like DOM to check its well-formedness. – Sajan Chandran May 17 '14 at 18:06
  • 3
    [Obligatory link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). **NEVER DO THIS**. There are thousands of ways of doing this correctly. For example [this](https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project) or [this](http://jsoup.org/). – Boris the Spider May 17 '14 at 18:11
  • duplicate question http://stackoverflow.com/questions/4217801/a-html-validator-in-java – geddamsatish May 17 '14 at 19:25

2 Answers2

1

If you really do need to parse HTML using pure java, there are many open source options available. However, I would recommend instead using W3C to validate your syntax as they by definition will be much more up to date on correct usages. Good luck with your project.

nanogru
  • 23
  • 2
  • 10
1
<(\"[^\"]*\"|'[^']*'|[^'\">])*>

is for a single tag,

<(\"[^\"]*\"|'[^']*'|[^'\">])*>(.*<(\"[^\"]*\"|'[^']*'|[^'\">])*>)?

is for a pair of tags or a single tag.

however, complex cases can't be validated by an one liner regex.

Alpha Huang
  • 1,297
  • 1
  • 12
  • 15