0

What regex expression will operate together with the Java replaceAll() method to remove the <p> html tag and its contents in between the tag from an HTML string?

For example, after applying the method,

"<div><p>table <b>test</b> title</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>"

becomes:

"<div><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>"

Note: This is an "academic" exercise. I am not seeking a solution that uses an XML/HTML parser.


UPDATE:

Getting closer to a solution on this (thanks, jlordo!)... You pattern seems to work somewhat...

However, the suggested regex string ("<[pP]>.*?</[pP]>") does not appear to have an effect on a <p> tag that contains an attribute (i.e., in this case a "style" attribute) -- see below,

    public static void main(String[] args)
    {
        String htmlstring = "<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>";
        htmlstring = htmlstring.replaceAll("<[pP]>.*?</[pP]>", "");
    }

htmlstring (before scrubbing):

<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>

htmlstring (after scrubbing):

<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>

Is there anything we can do to "tweak" it so that it handles this issue?

sairn
  • 461
  • 3
  • 24
  • 58
  • 4
    What have you tried? Also, have you read this: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) ? – jlordo Apr 18 '13 at 22:27
  • This is awfully vague for an academic exercise. Can you always guarantee that the `

    ` tag will contain no attributes? If you can't, will the attributes be devoid of `>` symbols? What about the closing tag? Will there always even be a closing tag? What if the

    runs into a table? or another opening

    ? Can

    tags be nested? There's a very good reason people don't use regex on HTML. Madness lies that way Regular expressions are incredibly specific. There no "Everything that looks like `

    `". There is only `/<\s*p(\s+\w+\s*=\s*("|')((?!\2).|\\\2)*\2)*\s*>/`

    – FrankieTheKneeMan Apr 18 '13 at 22:37
  • Can your input contain messy HTML? The kind that fails to validate? Do you care about things that look like `

    ` tags but are not because they occur inside other tokens like comments, or in the body of special tags like `

    – Mike Samuel Apr 18 '13 at 23:06
  • Hi Frankie - Could you post how to escape the string you provided, so that it will work in Java? Also, could you provide a short/succinct solution to accomplish simple removal of a "

    " tag using java api?

    – sairn Apr 18 '13 at 23:10
  • Hi Mike - we're not concerned about unclosed "

    " tags. The html we use will validate, prior to attempting the "replaceAll". thx.

    – sairn Apr 18 '13 at 23:17
  • As an "academic" exercise, you're going to have to specify a set of constraints on the HTML that you'll accept, because without those constraints the answer is "It can't be done". Given arbitrary input, even with the constraint that it is validated HTML, a true parser is required; a regex is insufficient to accomplish this task. A formal proof of this statement is left as an exercise for the reader. – Stephen P Apr 19 '13 at 00:58

2 Answers2

1

try

    htmlstring = htmlstring.replaceAll("(?i)<p.*?>.*?</p>", "");

note that (?i) means turn on case-insensitive flag

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • BINGO! ...Thanks, Evgeniy!!! Knowing in advance the nature of the html string I am working with (short and basic) makes the one-liner replaceAll() regex solution the optimal choice. --The idea of creating some beastly method using a "xml parser" solution to perform the same simple operation seemed silly. Anyway, thanks again! – sairn Apr 19 '13 at 12:59
1
Pattern.compile(
  // A start p tag.
  "<p(?![a-z0-9:\\-])([^>\"']|\"[^\"]*\"|'[^']*)*>"
  + ".*?"   // Phrasing content that does not handle comment, RCDATA or raw text boundaries
  // An end p tag
  + "</p(?![a-z0-9:\\-])[^>]*>",
  Pattern.DOTALL | Pattern.CASE_INSENSITIVE);

The Pattern.DOTALL flag will cause .*? to match newlines which is necessary because your original regex would not match any paragraph that contained a newline in its body.

The Pattern.CASE_INSENSITIVE flag is specified without Pattern.UNICODE_CASE because it's unnecessary and I'm not confident that Turkish case-folding wouldn't create a subtle maintenance hazard were this regex modified to deal with <i>.

The ([^>"']|"[^"]*"|'[^']*) part matches any tag body character or quoted attribute. It will misbehave on certain non-validating attribute names like <p ain't-this=confusing>. The attribute grammar is regular, but doing a full treatment of quote characters in attribute values vs names would hugely expand the size of this regex, and would not likely help since anything requiring a full treatment will have to deal with the fact that backticks can quote attributes on a few browsers which means that no single regular expression can find value boundaries for arbitrarily messy HTML.

The (?![a-z0-9:\\-]) makes sure the name of the tag is "p" and not "plaintext" or "p-" or "p:foo" or some other HTML identifier of which "p" is a prefix.

This may behave on some constructs like:

  • <p><!-- </p> -->Not an orphaned end tag</p>
  • <p><textarea>Not a paragraph</p></textarea></p>
  • <noscript><p>Not a paragraph contextually</p></noscript>
  • <p ain't-this=confusing>Foo</p> <p>Isn't recognized as separate</p>.
  • <p><script>alert("Not a real </p> tag");</script></p>
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Thanks for your help, Mike. And, also for the nice documentation - i.e., which I expect to refer to as I learn more about regex. For now, I've gone with the shorter lengthed solution from Evgeniy, simply because it suited my very narrow application requirement. -thx, again! – sairn Apr 19 '13 at 13:06
  • @saim, Good luck with your research. For reference, https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html-sanitizer.js#505 deals with messy (and untrusted) HTML by splitting on special characters instead and then uses a series of smaller regexps to decompose into tags, comments, attributes, and the like. – Mike Samuel Apr 19 '13 at 15:14