Condition filtering within starting and ending point

Question

I have 2 strings as below:

test1 = "<div>/*abc*/</div>";
test2 = "<div>/*abc*/Contents/*efg*/</div>";

I need to eliminate all /*...*/, div will remove if the div contains only /*...*/. The following is regex i did:

Regex rx1 = new Regex(@"<div>/\*[^>]+\*/(</div>|<br/></div>|<br></div>)");
TemplateEditorFormatted = rx1.Replace(TemplateEditorFormatted, match => { return String.Empty; });

for string test1, it return correct result, which remove all.

But for test2, it also remove all contents. estimated result should not remove anything.

UPDATED (For learning)

for test 2, if i want to eliminate /../ but not whole div. how the regex look like?

Can anybody help? Thanks

You shouldn't use regular expressions on HTML. Regular Expressions only work on regular languages, and HTML is a context-free language. It may work for very small specific examples, but it shouldn't be used because it will not work in general practice. — Tom Heard, Oct 23 '13 at 07:32
[Ask this guy about using regular expressions on HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Tom Heard, Oct 23 '13 at 09:25

score 1 · Answer 1 · answered Oct 23 '13 at 07:22

1

You're better off using negative lookahead assertions:

@"<div>/\*(?:.(?!\*/))*\*/(</div>|<br/></div>|<br></div>)"
          ^^^^^^^^^^^^^

The part of interest is (?:.(?!\*/))*.

The (?:foo) is simply a non-capturing group—for now you can pretend it's just (foo).
The . is a wildcard and matches any single character.
The (?!bar) is known as a negative lookahead assertion—it matches if bar does not follow, and is a zero-width expression, i.e. it doesn't consume any characters while matching.

So, the idea is to match a string of characters, ., that are not followed by */, and only then the */</div>.

answered Oct 23 '13 at 07:22

Andrew Cheong

29,362
15
90
145

Ur code work for me, THX! but if i want to select and remove /*..*/ only in test2? what the regex look like? means it will remove whole div if contains only /*..*/, if div contains other char then eliminate only /*..*/. possible to do? – user2909214 Oct 23 '13 at 07:43
Unfortunately that's probably not possible to do with a single regex. I don't think even C# supports variable-width lookbehind assertions, which is what you'd need. But you probably shouldn't be doing this with regex anyway. Instead, build a loop to find `
...
`s first, then replace `@"/\*.*?\*/"` within each div. (That `?` makes the `*` _non-greedy_.) – Andrew Cheong Oct 23 '13 at 07:58

score 1 · Answer 2 · answered Oct 23 '13 at 07:33

why to do it in one step? imho it is much more readable in two steps:

string s1 = "<div>/*abc*/</div>";
string s2 = "<div>/*abc*/Contents/*efg*/</div>";

Regex findComments = new Regex(@"/\*.*?\*/");
Regex findEmptyDivs = new Regex(@"<div></div>");

s1 = findComments.Replace(s1, "");
s1 = findEmptyDivs.Replace(s1, "");

s2 = findComments.Replace(s2, "");
s2 = findEmptyDivs.Replace(s2, "");

Condition filtering within starting and ending point

2 Answers2