Regex - Pattern finds parts of itself with (.+)

Question

In C#, I have the following Regex pattern (on an HTML string):

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.+)</tr>");

The problem is, that when I run it, the match includes everything until the last </tr> occurrence in the HTML code. There are many <tr> tags in the code, so the (.+) pattern includes them and stops only in the last occurrence of </tr>.

I've tried using (\w+) instead, but it doesn't get certain characters inside the tags.

So how can I make this pattern stop at the first </tr>, and not go until the last one in the code?

Reading on the subject: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1732454/335858) — Sergey Kalinichenko, Aug 23 '15 at 10:53
`.+?` ......... BTW: Use https://htmlagilitypack.codeplex.com/ instead of regex — Eser, Aug 23 '15 at 10:55
@Eser - It works! Thank you very much! Can you please explain to me how `?` works in Regex? I saw it mentioned on a website but didn't understand how exactly it makes the Regex go only until the first occurence of `<\tr>` in this specific case. — BlueRay101, Aug 23 '15 at 10:57
@Eser, I read about it and understood it. Thanks for your help! I'll also check out the HTML agility pack. — BlueRay101, Aug 23 '15 at 11:09
Playing with [balancing feature](http://www.regular-expressions.info/balancing.html) of .NET [tried this](http://regexhero.net/tester/?id=e1e26b38-eba5-4778-b615-2a5a2bb55dbc) as [explained here](http://www.rassoc.com/gregr/weblog/2003/05/15/nested-constructs-in-regular-expressions/), but sure better to use a parser. — Jonny 5, Aug 23 '15 at 11:49

score 0 · Answer 1 · answered Aug 23 '15 at 12:14

The following Regex pattern will stop at the first </tr> tag:

<tr(\s+)class(\s*)=(\s*)"[^"]*"(\s+)rel(\s*)=(\s*)"[^"]*"(\s*)>(.(?!<\/tr>))*[\s\S]<\/tr>

You can change your code into following to get what you wanted:

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.(?!<\/tr>))*[\s\S]</tr>");

(?!ABC) is called negative lookahead. It specifies a group that can not match after the main expression (if it matches, the result is discarded).

For future reference: Try using RegExr to create and test your regex patterns.

ΩmegaMan · Answer 2 · 2015-08-23T18:20:53.060

> So how can I make this pattern stop at the first </tr>

The most effective capturing process paradigm is to not consume blindly, but consume what is known.

Since the text to grab falls within the anchors of > and <, why not use that logic of the ending anchor, the <, to give the regex parser a hint?

By using the ^ character (it is the not in a set) in a set [ ] we effectively tell the parser to consume until a specific set of character(s) is hit.

In your case change

>(.+)</tr>

to [^<]+ which says consume everything until (or except for) when the < character is hit, one or more times:

>([^<]+)</tr>

The use of the [^ ] set is a powerful one which I use in 90% of my regex patterns instead of blinding consuming with .+ or the even more side affect prone .*.

Also to make your pattern easier to handle use \x22 in lieu of " so you are not fighting with the C# parser before the regex parser.

Regex - Pattern finds parts of itself with (.+)

2 Answers2