0

I have the following string:

<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>

I want to end up with:

<TD>6949/TD>

but instead I end up with just the tags and no information:

<TD></TD>

This is the regular expression I am using:

RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")

Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?

Xaisoft
  • 45,655
  • 87
  • 279
  • 432

3 Answers3

3

.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.

Change it to .*?, which is a lazy qualifier.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • Great thanks. So when I use the .*, it doesn't care if anything is in the middle, it keeps on going until it finds the last --> and removes every character in between including the – Xaisoft Jun 30 '11 at 14:42
2

.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.

zellio
  • 31,308
  • 1
  • 42
  • 61
2

Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.

Mrchief
  • 75,126
  • 20
  • 142
  • 189