Exclude a pattern when it is inside another pattern

Question

I was trying to find a regex for my requirement, but I couldn't find one. If anyone came across this please help me.

For example if html comment is inside a JSP comment then don't touch it, otherwise make it as JSP comment.

Condition: 1
<!-- normal HTML comment -->

with 

<%-- normal HTML comment --%>

But do not match the HTML comments inside the JSP comments as below.

Codition: 2
<%-- normal JSP comment 

     <!-- inside html comment here -->
      other comment stuff
 <!-- another inside html comment here -->

--%>

a java solution is much appreciated.

"I was trying to find a regex for my requirement, but I couldn't find one." - that's an example of [why regular expressions aren't a good fit for non-regular problem domains like HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) etc. You might be able to create an expression that handles conditions 1 and 2 (and it might get quite complex) and then you encounter condition 3 etc. Better use a parser that understands the problem domain (JSP code in your case). — Thomas, Aug 02 '16 at 16:17
You're going to need more than that if this is a mix of html and JSP. Is it a mix? — , Aug 02 '16 at 16:42
@kakurala - If it is it would be a little tricky with regex. — , Aug 02 '16 at 17:29
You're never going to get a bulletproof regex solution, eg `String foo = "<%-- what now?";` — Bohemian, Aug 02 '16 at 17:30
@Bohemian Since we're dealing with comments -> comments, the regex would not change the behavior of the code whatsoever. A syntax error would be caught by the runtime like always. It's also an advantage that comment syntaxes can't be arbitrarily nested. — 4castle, Aug 02 '16 at 17:34
@4castle I don't understand the relevance of anything you said. What behaviour? What syntax error? The code in my comment is valid. My point is that dealing with all eventualities, like quoted delimiters (which are not to be considered), greatly complicate the solution. — Bohemian, Aug 02 '16 at 17:38
Ah, I see now. I'm adding string literals to my list of contexts to ignore. — 4castle, Aug 02 '16 at 17:41
@4castle What about `String foo = "\"<%-- what now?";`. Or `String foo = "\"Oh my\" \"<%-- what now?\"";`. It's an impossible task. For every regex you create, I can create an example that you can't handle. It's nothing against you - it's just not solvable by regex. You need a *parser* that understands the syntax of the input and particularly nesting. It is likely acceptable to ignore such edge cases IRL, but I said "bulletproof". — Bohemian, Aug 02 '16 at 18:29
@Bohemian I already accounted for escaped double quotes in my regex. It works fine for those inputs too. — 4castle, Aug 02 '16 at 18:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118947/discussion-between-4castle-and-bohemian). — 4castle, Aug 02 '16 at 19:03

4castle · Accepted Answer · 2016-08-04T06:20:14.043

1

When trying to match something that isn't in context "X" or context "Y", I always use the formula from The Greatest Regex Trick Ever. The trick is to make a capture group on the right most side of an alternation which has what you want, and all of the other contexts you don't want on the left-hand side of the alternation.

In addition, the regex needs to ignore string literals. Your regex would look like:

".*?(?<!\\)"|(?s)<%--.*?--%>|<!--(.*?)-->

And then the code would only replace the string if there is a first capture group.

String input = getJSPString();

final Pattern p = Pattern.compile(
    "\".*?(?<!\\\\)\"|" +   // ignore string literals
    "(?s)<%--.*?--%>|" +    // ignore JSP comments
    "<!--(.*?)-->");        // capture HTML comments in group #1
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find()) {
    if (m.group(1) != null) {
        m.appendReplacement(sb, "<%--$1--%>");
    }
}
m.appendTail(sb);
String output = sb.toString();

Ideone Demo

edited Aug 04 '16 at 06:20

answered Aug 02 '16 at 16:19

4castle

32,613
11
69
106

let's say we have n such patterns, then the combination of conditions/expressions would be complicated. Is there any shortcut for this? – JavaHopper Aug 02 '16 at 16:20
1

@JavaHopper To ignore multiple contexts, just keep adding on to the left-hand side. So `notThis|orThis|evenThis|(WeWantThis)` – 4castle Aug 02 '16 at 16:21
Thanks for the trick @4castle, but here it is not working. Instead it is matching all html comments regardless of where it is. – kakurala Aug 02 '16 at 16:30
@kakurala That's intended. You have to inspect the capture groups. A normal `String#replace` won't do the job. – 4castle Aug 02 '16 at 16:37
@kakurala I'm updated my answer to show how to do it. – 4castle Aug 02 '16 at 17:06
@4castle I think it'll break if an html comment has a jsp comment in it, wouldn't it? – kakurala Aug 02 '16 at 18:41
@kakurala No, JSP is a preprocessor. It doesn't actually run the HTML at all. – 4castle Aug 02 '16 at 18:43
Yes I agree, but am about to run this regex on JSP sources to replace html comments with JSp's. – kakurala Aug 02 '16 at 19:07
Let me know how it goes! Maybe make a backup first, but from my testing I don't see any issue with it. – 4castle Aug 02 '16 at 19:08
It's seems my code was reinventing the wheel. Now using [`appendReplacement`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-) and [`appendTail`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendTail-java.lang.StringBuffer-) – 4castle Aug 04 '16 at 06:21
@4castle I addressed the condition, if JSP comments are inside a HTML comment then replace the original HTML comment and then invalidate inside comments. – kakurala Oct 04 '16 at 14:50

score 0 · Answer 2 · 2016-08-08T14:54:29.633

You mention your source is an html mix, I'll offer this variation
that removes any complications html tags may introduce.

With the addition of the atomic group and the \G anchor
there is little risk of stack overflow.

Replace with $1<%--$2--%>

Raw Regex:

\G((?><(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?)))|%--[\S\s]*?--%)>|(?!<!--[\S\s]*?-->)[\S\s])*)<!--([\S\s]*?)-->

Stringed Regex:

"\\G((?><(?:script(?:\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])*?)+)?\\s*>[\\S\\s]*?</script\\s*|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:(?:(?:\"[\\S\\s]*?\")|(?:'[\\S\\s]*?'))|(?:[^>]*?))+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?)))|%--[\\S\\s]*?--%)>|(?!<!--[\\S\\s]*?-->)[\\S\\s])*)<!--([\\S\\s]*?)-->"

Expanded/Formatted:

 \G                            # G anchor
 (                             # (1 start)
      (?>                           # Atomic group start

           <                             # Begin a Tag <, but not an html comment
           (?:
                script                        # Script
                (?:
                     \s+ 
                     (?:
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )*?
                     )+
                )?
                \s* >
                [\S\s]*? </script \s* 
             |                              # or,
                (?:                           # Non-attribute
                     /? 
                     [\w:]+ 
                     \s* 
                     /? 
                )
             |                              # or,
                (?:                           # Attribute
                     [\w:]+ 
                     \s+ 
                     (?:
                          (?:
                               (?: " [\S\s]*? " )
                            |  (?: ' [\S\s]*? ' )
                          )
                       |  (?: [^>]*? )
                     )+
                     \s* 
                     /? 
                )
             |                              # or,
                \?                            # <? ?> form
                [\S\s]*? 
                \?
             |                              # or,
                (?:                           # Misc <! > forms
                     !
                     (?:
                          (?:
                               DOCTYPE
                               [\S\s]*? 
                          )
                       |  (?:
                               \[CDATA\[
                               [\S\s]*? 
                               \]\]
                          )
                       |  (?:
                               ATTLIST
                               [\S\s]*? 
                          )
                       |  (?:
                               ENTITY
                               [\S\s]*? 
                          )
                       |  (?:
                               ELEMENT
                               [\S\s]*? 
                          )
                     )
                )
             |                              # or,
                %-- [\S\s]*? --%              # JSP comment
           )
           >                             # End a Tag >

        |                              # or,
                                         # A character that does 
                                         # not begin a html comment
           (?! <!-- [\S\s]*? --> )
           [\S\s] 
      )*                            # Atomic group end, 0 to many times
 )                             # (1 end)

 <!--
 ( [\S\s]*? )                  # (2), Finally, the Html comment
 -->

What would cause the stack to overflow? I don't see any recursion. — 4castle, Aug 02 '16 at 19:02
@4castle - I don't know the facts too much other than Java regex implementation is of a recursive nature. I guess this includes backtracking. Google it, it's all over the place. — , Aug 03 '16 at 16:54
@sln, thanks for your efforts. I tried 4castle's version of code with little modifications and it is working for my requirement. — kakurala, Aug 08 '16 at 07:18
@kakurala - You're welcome. I'm going to leave this posted for someone who needs a solution. — , Aug 08 '16 at 14:56

score 0 · Answer 3 · answered Aug 02 '16 at 18:04

0

you could use this pattern

(<!(--(?:[^-]|-(?!->))*?--)>)(?!((?!<%--)[\s\S])*?--%>)

and replace w/ <%$2%>
Demo

answered Aug 02 '16 at 18:04

alpha bravo

7,838
1
19
23

Exclude a pattern when it is inside another pattern

3 Answers3