1

I can get the string of my interest using regex, but how do I replace it with a character substituted in the capture?

I want to remove the > character from inside any html attribute, or replace it with >.

Sample original string

<html> 
<head></head> 
<body> 
<div  sometag="abc>def" onclick="myfn()" class='xyz'>
Dear {@CustomerName},
blah blah blah
</div></body> 
</html>

Desired result

<html> 
<head></head> 
<body> 
<div  sometag="abc&gt;def" onclick="myfn()" class='xyz'>
Dear {@CustomerName},
blah blah blah
</div></body> 
</html>

I'm using the following regex pattern and replacement

Pattern: \s\w+\s*=\s*(['"])[^\1]+?\1

Replacement: -- don't know! what should I use? --

This is my vb.net code (just in case if it helps)

Dim reAttr As New Regex("\s\w+\s*=\s*(['""])[^\1]+?\1", RegexOptions.Singleline)
result = reAttr.Replace(text, Replace("$&", ">", ""))
Pradeep Kumar
  • 6,836
  • 4
  • 21
  • 47

1 Answers1

1

You can use

Dim reAttr As New Regex("\s\w+\s*=\s*(['""])(?:(?!\1).)*?\1", RegexOptions.Singleline)
Dim result = reAttr.Replace(text, New MatchEvaluator(Function(m As Match)
         Return m.Value.Replace(">", "-")
     End Function))

Note that [^\1] is not doing what you expect, it matches any char but a SOH char (\x01). The (?:(?!\1).)*? tempered greedy token does what you wanted, it matches any char, other than the value captured in Group 1, 0 or more times, as few times as possible.

The MatchEvaluator is used as the replacement arguments where you may access the whole match value with m.Value.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    works perfectly! thank you so much. I wasted one full day on this. Also, the `[\1]` was selecting the part of string that I was expecting, so got confused with it. thanks for correcting that too. – Pradeep Kumar May 09 '20 at 13:44