0

Possible Duplicate:
How to remove HTML tag in Java
RegEx match open tags except XHTML self-contained tags

I want to remove specific HTML tag with its content.

For example, if the html is:

<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>

If the tag contains "mso-*", it must remove the whole tag (opening, closing and content).

Community
  • 1
  • 1
Elyess Abouda
  • 659
  • 12
  • 20
  • 10
    Personally, I'd use an HTML parser. – Dave Newton Jan 02 '13 at 15:01
  • 2
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and [how-to-remove-html-tag-in-java](http://stackoverflow.com/questions/1699313/how-to-remove-html-tag-in-java) – CoolBeans Jan 02 '13 at 15:02
  • Hasn't these type of questions been asked many times here? – Buhake Sindi Jan 02 '13 at 15:42

1 Answers1

1

As Dave Newton pointed out in his comment, a html parser is the way to go here. If you really want to do it the hard way, here's a regex that works:

    String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
        + "\"Times New Roman\";display:none;mso-hide:all'>contents</span>BAR";
    // regex matches every opening tag that contains 'mso-' in an attribute name
    // or value, the contents and the corresponding closing tag
    String regex = "<(\\S+)[^>]+?mso-[^>]*>.*?</\\1>";
    String replacement = "";
    System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR
Community
  • 1
  • 1
jlordo
  • 37,490
  • 6
  • 58
  • 83
  • And if the style attribute doesn't contain any `mso-` directive... maybe a more generalized regexp would be in order. – pap Jan 02 '13 at 15:26
  • @pap let me quote the OP: _If the tag contains "mso-*", it must remove the whole tag (opening, closing and content)._ My post answers his question, and I don't understand your comment. – jlordo Jan 02 '13 at 15:29
  • 1
    Indeed you are correct. Shame on me for not reading the question properly :) And I think you underestimate yourself, you seem to have understood my comment just fine, just that I was incorrect ;) – pap Jan 02 '13 at 15:31
  • @pap it was my polite way of saying, I think your comment is wrong ;) – jlordo Jan 02 '13 at 15:34