0

In C#, I have the following Regex pattern (on an HTML string):

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.+)</tr>");

The problem is, that when I run it, the match includes everything until the last </tr> occurrence in the HTML code. There are many <tr> tags in the code, so the (.+) pattern includes them and stops only in the last occurrence of </tr>.

I've tried using (\w+) instead, but it doesn't get certain characters inside the tags.

So how can I make this pattern stop at the first </tr>, and not go until the last one in the code?

Dipu
  • 6,999
  • 4
  • 31
  • 48
BlueRay101
  • 1,447
  • 2
  • 18
  • 29
  • 1
    Reading on the subject: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1732454/335858) – Sergey Kalinichenko Aug 23 '15 at 10:53
  • 1
    `.+?` ......... BTW: Use https://htmlagilitypack.codeplex.com/ instead of regex – Eser Aug 23 '15 at 10:55
  • Try `.*?` instead of `.+`. – Sebastian Simon Aug 23 '15 at 10:56
  • @Eser - It works! Thank you very much! Can you please explain to me how `?` works in Regex? I saw it mentioned on a website but didn't understand how exactly it makes the Regex go only until the first occurence of `<\tr>` in this specific case. – BlueRay101 Aug 23 '15 at 10:57
  • 2
    @BlueRay101 search for *non-greedy* match. – Eser Aug 23 '15 at 11:02
  • @Eser, I read about it and understood it. Thanks for your help! I'll also check out the HTML agility pack. – BlueRay101 Aug 23 '15 at 11:09
  • Playing with [balancing feature](http://www.regular-expressions.info/balancing.html) of .NET [tried this](http://regexhero.net/tester/?id=e1e26b38-eba5-4778-b615-2a5a2bb55dbc) as [explained here](http://www.rassoc.com/gregr/weblog/2003/05/15/nested-constructs-in-regular-expressions/), but sure better to use a parser. – Jonny 5 Aug 23 '15 at 11:49

2 Answers2

0

The following Regex pattern will stop at the first </tr> tag:

<tr(\s+)class(\s*)=(\s*)"[^"]*"(\s+)rel(\s*)=(\s*)"[^"]*"(\s*)>(.(?!<\/tr>))*[\s\S]<\/tr>

You can change your code into following to get what you wanted:

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.(?!<\/tr>))*[\s\S]</tr>");

(?!ABC) is called negative lookahead. It specifies a group that can not match after the main expression (if it matches, the result is discarded).

For future reference: Try using RegExr to create and test your regex patterns.

Dipu
  • 6,999
  • 4
  • 31
  • 48
-1
> So how can I make this pattern stop at the first </tr>

The most effective capturing process paradigm is to not consume blindly, but consume what is known.

Since the text to grab falls within the anchors of > and <, why not use that logic of the ending anchor, the <, to give the regex parser a hint?

By using the ^ character (it is the not in a set) in a set [ ] we effectively tell the parser to consume until a specific set of character(s) is hit.

In your case change

>(.+)</tr>

to [^<]+ which says consume everything until (or except for) when the < character is hit, one or more times:

>([^<]+)</tr>

The use of the [^ ] set is a powerful one which I use in 90% of my regex patterns instead of blinding consuming with .+ or the even more side affect prone .*.


Also to make your pattern easier to handle use \x22 in lieu of " so you are not fighting with the C# parser before the regex parser.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122