1

I have this HTML code:

<td class="Class 1">Example</td><td class="Class2">Other Example</td>

and I am trying to use Regular Expressions in VB.NET to extract "Example" and "Other Example"

Dim parsedtext As MatchCollection = Regex.Matches(htmlcode, ">(.+)<)

(the htmlcode variable contains the html code mentioned above as a string.)

However, looking at parsedtext(0).Groups(0) , it is returning ">Example</td><td class="Class2">Other Example<". I do not understand why this is happening, and I have tried many other pattern strings and cannot figure this problem out. How would one extract all text between two specific characters such as > and < in the example above?

1 Answers1

1

I agree with @ColeJohnson (no one on SO is allowed to believe otherwise, at this point), but it's a good example for teaching the concept of greedy versus non-greedy matching.

By default, regular expressions quantifiers (+, *, ?) "eat up" as much as possible, and only eat less when some part of the match fails. That's called greedy matching. To make it non-greedy, you use non-greedy quantifiers: +?, *?, ??.

That is,

">(.+?)<"

In other words, your .+ continued to match as many character as possible, before finding a <; so you see, your output was to be expected. If, however, hypothetically, it had not found that last <, it would have backtracked to the last time it "saw" a <.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145