C# Regex Problem

Question

I want to extract all table rows from an HTML page. But using the pattern @"<tr>([\w\W]*)</tr>" is not working. It's giving one result which is first occurence of <tr> to last occurrence of </tr>. But I want every occurrence of <tr>...</tr> value. Can anyone please tell me how I can do this?

score 5 · Answer 1 · edited Nov 27 '17 at 00:04

5

[\w\W]* matches greedily so it will match from the first <tr> to the last </tr>.

A regex approach won't work well because HTML is not a regular language. If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>" with the RegexOptions.Singleline flag, however this isn't guaranteed to work in all cases.

For parsing HTML you need an HTML parser. Try HTML Agility Pack.

edited Nov 27 '17 at 00:04

carla

1,970
1
31
44

answered Feb 04 '11 at 22:55

Mark Byers

811,555
193
1,581
1,452

2

And we all know what happens when you try to parse html with a regex... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Zach Johnson Feb 04 '11 at 22:58
Another question is there anyway so that I can do it using regex ? – Barun Feb 04 '11 at 22:58
1

This page shows a quick example of how the HTML Agility Pack library can be used: http://htmlagilitypack.codeplex.com/wikipage?title=Examples – Mark Byers Feb 04 '11 at 23:09

score 2 · Accepted Answer · answered Feb 04 '11 at 23:00

2

I do agree with Mark: you should to use HTML Agility Pack library.

About your regex, you should to go with something like:

@"<tr>([\s\S]*?)</tr>"

That's a non greedy pattern, and you should to get one match for every TR.

answered Feb 04 '11 at 23:00

Rubens Farias

57,174
8
131
162

Another question... Can you provide me any link or book name where I can learn this all regex [C#] property properly ? – Barun Feb 04 '11 at 23:06

C# Regex Problem

2 Answers2