What regex expression will operate together with the Java replaceAll()
method to remove the <p>
html tag and its contents in between the tag from an HTML string?
For example, after applying the method,
"<div><p>table <b>test</b> title</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>"
becomes:
"<div><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>"
Note: This is an "academic" exercise. I am not seeking a solution that uses an XML/HTML parser.
UPDATE:
Getting closer to a solution on this (thanks, jlordo!)... You pattern seems to work somewhat...
However, the suggested regex string ("<[pP]>.*?</[pP]>"
) does not appear to have an effect on a <p>
tag that contains an attribute (i.e., in this case a "style" attribute) -- see below,
public static void main(String[] args)
{
String htmlstring = "<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>";
htmlstring = htmlstring.replaceAll("<[pP]>.*?</[pP]>", "");
}
htmlstring (before scrubbing):
<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>
htmlstring (after scrubbing):
<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>
Is there anything we can do to "tweak" it so that it handles this issue?
` tag will contain no attributes? If you can't, will the attributes be devoid of `>` symbols? What about the closing tag? Will there always even be a closing tag? What if the
runs into a table? or another opening
? Can
tags be nested? There's a very good reason people don't use regex on HTML. Madness lies that way Regular expressions are incredibly specific. There no "Everything that looks like `
`". There is only `/<\s*p(\s+\w+\s*=\s*("|')((?!\2).|\\\2)*\2)*\s*>/`
– FrankieTheKneeMan Apr 18 '13 at 22:37` tags but are not because they occur inside other tokens like comments, or in the body of special tags like `
– Mike Samuel Apr 18 '13 at 23:06" tag using java api?
– sairn Apr 18 '13 at 23:10" tags. The html we use will validate, prior to attempting the "replaceAll". thx.
– sairn Apr 18 '13 at 23:17