0

I have html which contains such text

.......
<a class="product_name" href="index.php?productID=29785">Funny</a>
........
<a class="product_name" href="index.php?productID=29787">Very Funny</a>
......

I'd like to href attribute value and text into link so I'd like to get

"index.php?productID=29785", "Funny"
"index.php?productID=29787", "Very Funny"

And I use

MatchCollection mc = Regex.Matches(pageData, 
   "<a class=\"product_name\" href=\"(.+)\">(.+)</a>");

For this. But when I debug code I saw that mc.Count = 0

I think I didn't escaped quotes properly, but I don't know.

Sklivvz
  • 30,601
  • 24
  • 116
  • 172
takayoshi
  • 2,789
  • 5
  • 36
  • 56
  • 5
    Parsing html with regex is infamously not a good idea – Marc Gravell Nov 05 '11 at 21:03
  • 1
    I get count=2 here, btw, with capture-groups that work as expected. The regex shown works for the html shown. If it isn't working, then either a: you aren't presenting the scenario identically, or b: the html is more complicated, making it insanely hard for all the many reasons that you shouldn't parse html with regex – Marc Gravell Nov 05 '11 at 21:05
  • 1
    Agreed. It works here as well (http://regexhero.net/tester/) – Sklivvz Nov 05 '11 at 21:09

2 Answers2

5

Don't parse HTML with regex. See here for a compelling reason why.

Use the HTML Agility Pack instead.

carla
  • 1,970
  • 1
  • 31
  • 44
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Regex is not good for parsing html, but I wouldn't include a new dependency to my project for such a simple task. – L.B Nov 05 '11 at 21:14
  • @L.B - What would you suggest then? Writing your own parser/tokenizer? – Oded Nov 05 '11 at 21:15
  • @L.B - You seem to be contradicting yourself. _"Regex is not good for parsing html"_ ... _"I would use Regex"_. – Oded Nov 05 '11 at 21:18
-1

Review the following threads to find possible solution(s):

http://www.dotnetperls.com/scraping-html

Regex to Parse Hyperlinks and Descriptions

Parse HTML links using C#

Community
  • 1
  • 1
Mikhail
  • 9,186
  • 4
  • 33
  • 49
  • 1
    -1 - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oded Nov 05 '11 at 21:05