Regex to replace single quote with single quote twice if it is inside

Question

Regular Expression to replace ' with '' if it is inside <xsl: else ' should remain as it is.
Code Snippet:

public static void main(String[] args) {
        String replaceSingleQuoteInsideXsltCondition = "(<\\s*?xsl\\s*?:.*?=.*?)(')(.*?)(')(.*?>)";
        String dummyXSLT = "<p>Thank you for sending us <xsl:for-each select=\"catalog/cd[artist='Bob Dylan']\"> " +
                "paper's to prove your <span class=\"highlight\"><xsl:if test=\"D01 ='Y'\">Income</xsl:if></span> <span class=\"highlight\"><xsl:if test=\"D02 ='Y'\">&#160;and&#160;" +
                "</xsl:if></span><span class=\"highlight\"><xsl:if test=\"D03 ='Y'\">Citizenship and/or Identity</xsl:if></span>. " +
                "We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>" +
                "contact number for inquiry = '478965152' and email id = 'pqr@xyz'" +
                "<xsl:template match=\"num[ . = 3 or . = 5]\"/></xsl:stylesheet><xsl:if test=\"contains($search, 'Web Developer') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))\">" +
                "<xsl:if test=\"((node/ABC!='') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))\"> just a dummy sample.</xsl:if>";
        System.out.println(dummyXSLT.replaceAll(replaceSingleQuoteInsideXsltCondition,  "$1''$3''$5"));
    }

Actual Result by Above Code:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, 'Computer') or contains($expSearch, 'Information') or contains($expSearch, 'Web' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='') and (normalize-space(node/GHI)=''))"> just a dummy sample.</xsl:if>

Expected Result:

<p>Thank you for sending us <xsl:for-each select="catalog/cd[artist=''Bob Dylan'']"> paper's to prove your <span class="highlight"><xsl:if test="D01 =''Y''">Income</xsl:if></span> <span class="highlight"><xsl:if test="D02 =''Y''">&#160;and&#160;</xsl:if></span><span class="highlight"><xsl:if test="D03 =''Y''">Citizenship and/or Identity</xsl:if></span>. We need a little more information to finish your application. Addition of few words like 7 o'clock, employees' or employ's and child's and 'xyz and 'hello'</p>contact number for inquiry = '478965152' and email id = 'pqr@xyz'<xsl:template match="num[ . = 3 or . = 5]"/></xsl:stylesheet><xsl:if test="contains($search, ''Web Developer'') and (contains($expSearch, ''Computer'') or contains($expSearch, ''Information'') or contains($expSearch, ''Web'' ))"><xsl:if test="((node/ABC!='''') and (normalize-space(node/DEF)='''') and (normalize-space(node/GHI)=''''))"> just a dummy sample.</xsl:if>

For something regex related that is this simple, you don't need to post a thesis as an example. Show before and after string's desired, the regex used and the strings that give you problems. It's always better to initially make a regex that does everything you need. Because you have to use that as a reference when you try to break the regex up into pieces, which changes the entire scope. — , Mar 21 '17 at 19:31
@Sanjay could you add a one word example maybe (e.g. `computer` vs `''computer''` or something? — Darshan Mehta, Mar 21 '17 at 19:57
I have Added Section to elaborate the problem in simple term. — Sanjay Madnani, Mar 22 '17 at 12:55
@Sanjay Madnani Out of curiosity (mostly), could you make use of one of the answers? — Yunnosch, Apr 07 '17 at 11:30

Yunnosch · Accepted Answer · 2017-04-02T21:33:43.753

I assume that it is ok to use a two different regex-replacements, one in a loop.
(The "g" modifier does not help.)

Here is the concept for java implementation for your usecase:

first replace all '' by '''',
once but globally
replace (<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+) by \1''\3''\5, not globally but in a loop until it does not replace anything anymore
if that works, the next step is to make it accept xsl and also XSL and also allow the desired optional whitespace
(<\\s*(xsl|XSL)([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)

I am no javaman (respectful pun intended), so I cannot offer a demonstrator in java.
Here is a demonstrator (you do not need it, just to show what I tested) in sed.
It implements above concept and has the desired output for the given sample input.

bash-3.1$ sed -En "1{s/''/''''/g;:a;s/(<xsl([^>']|'')+)'(([^>']|[^>']+'')+)'(([^'>])+)/\1''\3''\5/;ta;p};" input.txt > output.txt

The main trick is to look for something which does NOT occur in an already successfully replaced part and then replace while successful.
The secondary trick is to first replace everything which needs to be replaced, but already looks replaced (''-> '''').

Note:
While java and sed have potentially different regex flavors, I don't see anything which obviously conflicts, when comparing your regex with mine. Mine does not even contain any \s \d \w or similar.
You might have to use your $1''$3''$5 instead of my \1''\3''\5.

score 0 · Answer 2 · edited May 23 '17 at 12:34

0

This is impossible if you allow arbitrary nesting of elements within the <xsl> </> tags. See RegEx match open tags except XHTML self-contained tags.

You could design a regex for this particular case, but not for every possible case.

edited May 23 '17 at 12:34

Community

1
1

answered Mar 27 '17 at 18:41

whaleberg

2,093
22
26

I don't have to do any parsing. Just for Creating DBCR for XSLT I need Regex. It has only one simple rule that is: I have to replace `'` with `''` in all the condition check. and remaining all other places I will replace `'` with `'||'''`. That's it. If you will see my code then you will find that I am almost getting the proper result. Only in the places where condition check has `and` or `OR` operator then I my result is not as expected. – Sanjay Madnani Mar 29 '17 at 19:15

score 0 · Answer 3 · 2017-03-27T23:04:04.790

If you are just parsing the TAGS this works.
If you are trying to interpret HTML closure, it can't be done with Java
regex.

The basic idea is that you can't just parse xsl tags. All tags must be parsed
to advance the match position and go past tags that may hide html.

So, all tags must be parsed.
In the regex below, Capture Group 2 contains the xsl tags you want to find.

All tags will be matched. You can ignore those and just look for when
capture group 2 has length. That is the one you want to manipulate.

What we do is a Replace All with a Callback.

Inside the callback:

If capture group 2 did not match (i.e. has no length)
just return the contents of capture group 0 (the match).
This just replaces with what matched. These are the other tags.
If capture group 2 did match copy group 2 to a string
and run another regex replace on that strinG (it's contents).
That would be a global Find (?<!')'(?!') Replace ''.
Return that string as the replacement in the callback.

That's all there is to it.

Hold on to your yourself now.
This is the regex.

(Feel free to make this case insensitive if you want)

Expanded

 <
 (?:
      (?:
           (?:
                # Invisible content; end tag req'd
                (                             # (1 start)
                     script
                  |  style
                     #|  head
                  |  object
                  |  embed
                  |  applet
                  |  noframes
                  |  noscript
                  |  noembed 
                )                             # (1 end)
                (?:
                     \s+ 
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                )?
                \s* >
           )

           [\S\s]*? </ \1 \s* 
           (?= > )
      )

   |  (?: /? [\w:]+ \s* /? )

   |  (                             # (2 start), The xsl: we want to find
           xsl: [\w:-]* 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )                             # (2 end)
   |  (?:
           [\w:]+ 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )
   |  \? [\S\s]*? \?
   |  (?:
           !
           (?:
                (?: DOCTYPE [\S\s]*? )
             |  (?: \[CDATA\[ [\S\s]*? \]\] )
             |  (?: -- [\S\s]*? -- )
             |  (?: ATTLIST [\S\s]*? )
             |  (?: ENTITY [\S\s]*? )
             |  (?: ELEMENT [\S\s]*? )
           )
      )
 )
 >

Final note - To see how effective and quick this regex is,
get a large html source code. Run a global find and replace with ''.
You will now see all the content, totally stripped of html.

Regex to replace single quote with single quote twice if it is inside

3 Answers3

Linked