0

I want to extract all table rows from an HTML page. But using the pattern @"<tr>([\w\W]*)</tr>" is not working. It's giving one result which is first occurence of <tr> to last occurrence of </tr>. But I want every occurrence of <tr>...</tr> value. Can anyone please tell me how I can do this?

dtb
  • 213,145
  • 36
  • 401
  • 431
Barun
  • 1,885
  • 3
  • 27
  • 47

2 Answers2

5

[\w\W]* matches greedily so it will match from the first <tr> to the last </tr>.

A regex approach won't work well because HTML is not a regular language. If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>" with the RegexOptions.Singleline flag, however this isn't guaranteed to work in all cases.

For parsing HTML you need an HTML parser. Try HTML Agility Pack.

carla
  • 1,970
  • 1
  • 31
  • 44
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • 2
    And we all know what happens when you try to parse html with a regex... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Zach Johnson Feb 04 '11 at 22:58
  • Another question is there anyway so that I can do it using regex ? – Barun Feb 04 '11 at 22:58
  • 1
    This page shows a quick example of how the HTML Agility Pack library can be used: http://htmlagilitypack.codeplex.com/wikipage?title=Examples – Mark Byers Feb 04 '11 at 23:09
2

I do agree with Mark: you should to use HTML Agility Pack library.

About your regex, you should to go with something like:

@"<tr>([\s\S]*?)</tr>"

That's a non greedy pattern, and you should to get one match for every TR.

Rubens Farias
  • 57,174
  • 8
  • 131
  • 162
  • Another question... Can you provide me any link or book name where I can learn this all regex [C#] property properly ? – Barun Feb 04 '11 at 23:06