I want to extract all table rows from an HTML page.
But using the pattern @"<tr>([\w\W]*)</tr>"
is not working.
It's giving one result which is first occurence of <tr>
to last occurrence of </tr>
.
But I want every occurrence of <tr>...</tr>
value.
Can anyone please tell me how I can do this?
Asked
Active
Viewed 102 times
0
2 Answers
5
[\w\W]*
matches greedily so it will match from the first <tr>
to the last </tr>
.
A regex approach won't work well because HTML is not a regular language. If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>"
with the RegexOptions.Singleline
flag, however this isn't guaranteed to work in all cases.
For parsing HTML you need an HTML parser. Try HTML Agility Pack.

carla
- 1,970
- 1
- 31
- 44

Mark Byers
- 811,555
- 193
- 1,581
- 1,452
-
2And we all know what happens when you try to parse html with a regex... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Zach Johnson Feb 04 '11 at 22:58
-
Another question is there anyway so that I can do it using regex ? – Barun Feb 04 '11 at 22:58
-
1This page shows a quick example of how the HTML Agility Pack library can be used: http://htmlagilitypack.codeplex.com/wikipage?title=Examples – Mark Byers Feb 04 '11 at 23:09
2
I do agree with Mark: you should to use HTML Agility Pack library.
About your regex, you should to go with something like:
@"<tr>([\s\S]*?)</tr>"
That's a non greedy pattern, and you should to get one match for every TR.

Rubens Farias
- 57,174
- 8
- 131
- 162
-
Another question... Can you provide me any link or book name where I can learn this all regex [C#] property properly ? – Barun Feb 04 '11 at 23:06