4

Regular Expression to replace ' with '' if it is inside <xsl: else ' should remain as it is.
Code Snippet:

public static void main(String[] args) {
        String replaceSingleQuoteInsideXsltCondition = "(<\\s*?xsl\\s*?:.*?=.*?)(')(.*?)(')(.*?>)";
        String dummyXSLT = "<p>Thank you for sending us <xsl:for-each select=\"catalog/cd[artist='Bob Dylan']\"> " +
                "paper's to prove your <span class=\"highlight\"><xsl:if test=\"D01 ='Y'\">Income</xsl:if></span> <span class=\"highlight\"><xsl:if test=\"D02 ='Y'\">&#160;and&#160;" +
                "</xsl:if></span><span class=\"highlight\"><xsl:if test=\"D03 ='Y'\">Citizenship and/or Identity</xsl:if></span>. " +
                "We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>" +
                "contact number for inquiry = '478965152' and email id = 'pqr@xyz'" +
                "<xsl:template match=\"num[ . = 3 or . = 5]\"/></xsl:stylesheet><xsl:if test=\"contains($search, 'Web Developer') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))\">" +
                "<xsl:if test=\"((node/ABC!='') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))\"> just a dummy sample.</xsl:if>";
        System.out.println(dummyXSLT.replaceAll(replaceSingleQuoteInsideXsltCondition,  "$1''$3''$5"));
    }

Actual Result by Above Code:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))"> just a dummy sample.</xsl:if>

Expected Result:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, ''Computer'') or contains($expSearch, ''Information'') or contains($expSearch, ''Web'' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='''') and (normalize-space(node/GHI)=''''))"> just a dummy sample.</xsl:if>
Sanjay Madnani
  • 803
  • 7
  • 16
  • For something regex related that is this simple, you don't need to post a thesis as an example. Show before and after string's desired, the regex used and the strings that give you problems. It's always better to initially make a regex that does everything you need. Because you have to use that as a reference when you try to break the regex up into pieces, which changes the entire scope. –  Mar 21 '17 at 19:31
  • @Sanjay could you add a one word example maybe (e.g. `computer` vs `''computer''` or something? – Darshan Mehta Mar 21 '17 at 19:57
  • I have Added Section to elaborate the problem in simple term. – Sanjay Madnani Mar 22 '17 at 12:55
  • @Sanjay Madnani Out of curiosity (mostly), could you make use of one of the answers? – Yunnosch Apr 07 '17 at 11:30

3 Answers3

1

I assume that it is ok to use a two different regex-replacements, one in a loop.
(The "g" modifier does not help.)

Here is the concept for java implementation for your usecase:

  • first replace all '' by '''',
    once but globally
  • replace (<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+) by \1''\3''\5, not globally but in a loop until it does not replace anything anymore
  • if that works, the next step is to make it accept xsl and also XSL and also allow the desired optional whitespace
    (<\\s*(xsl|XSL)([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)

I am no javaman (respectful pun intended), so I cannot offer a demonstrator in java.
Here is a demonstrator (you do not need it, just to show what I tested) in sed.
It implements above concept and has the desired output for the given sample input.

bash-3.1$ sed -En "1{s/''/''''/g;:a;s/(<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)/\1''\3''\5/;ta;p};" input.txt > output.txt

The main trick is to look for something which does NOT occur in an already successfully replaced part and then replace while successful.
The secondary trick is to first replace everything which needs to be replaced, but already looks replaced (''-> '''').

Note:
While java and sed have potentially different regex flavors, I don't see anything which obviously conflicts, when comparing your regex with mine. Mine does not even contain any \s \d \w or similar.
You might have to use your $1''$3''$5 instead of my \1''\3''\5.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
0

This is impossible if you allow arbitrary nesting of elements within the <xsl> </> tags. See RegEx match open tags except XHTML self-contained tags.

You could design a regex for this particular case, but not for every possible case.

Community
  • 1
  • 1
whaleberg
  • 2,093
  • 22
  • 26
  • I don't have to do any parsing. Just for Creating DBCR for XSLT I need Regex. It has only one simple rule that is: I have to replace `'` with `''` in all the condition check. and remaining all other places I will replace `'` with `'||'''`. That's it. If you will see my code then you will find that I am almost getting the proper result. Only in the places where condition check has `and` or `OR` operator then I my result is not as expected. – Sanjay Madnani Mar 29 '17 at 19:15
0

If you are just parsing the TAGS this works.
If you are trying to interpret HTML closure, it can't be done with Java
regex.

The basic idea is that you can't just parse xsl tags. All tags must be parsed
to advance the match position and go past tags that may hide html.

So, all tags must be parsed.
In the regex below, Capture Group 2 contains the xsl tags you want to find.

All tags will be matched. You can ignore those and just look for when
capture group 2 has length. That is the one you want to manipulate.

What we do is a Replace All with a Callback.

Inside the callback:

  • If capture group 2 did not match (i.e. has no length)
    just return the contents of capture group 0 (the match).
    This just replaces with what matched. These are the other tags.

  • If capture group 2 did match copy group 2 to a string
    and run another regex replace on that strinG (it's contents).
    That would be a global Find (?<!')'(?!') Replace ''.
    Return that string as the replacement in the callback.

That's all there is to it.

Hold on to your yourself now.
This is the regex.

(Feel free to make this case insensitive if you want)

"<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(xsl:[\\w:-]*\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"

Expanded

 <
 (?:
      (?:
           (?:
                # Invisible content; end tag req'd
                (                             # (1 start)
                     script
                  |  style
                     #|  head
                  |  object
                  |  embed
                  |  applet
                  |  noframes
                  |  noscript
                  |  noembed 
                )                             # (1 end)
                (?:
                     \s+ 
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                )?
                \s* >
           )

           [\S\s]*? </ \1 \s* 
           (?= > )
      )

   |  (?: /? [\w:]+ \s* /? )

   |  (                             # (2 start), The xsl: we want to find
           xsl: [\w:-]* 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )                             # (2 end)
   |  (?:
           [\w:]+ 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )
   |  \? [\S\s]*? \?
   |  (?:
           !
           (?:
                (?: DOCTYPE [\S\s]*? )
             |  (?: \[CDATA\[ [\S\s]*? \]\] )
             |  (?: -- [\S\s]*? -- )
             |  (?: ATTLIST [\S\s]*? )
             |  (?: ENTITY [\S\s]*? )
             |  (?: ELEMENT [\S\s]*? )
           )
      )
 )
 >

Final note - To see how effective and quick this regex is,
get a large html source code. Run a global find and replace with ''.
You will now see all the content, totally stripped of html.