-1

I have an html code. I parse it with such regex

MatchCollection matches = Regex.Matches(go, @"photoWrapper""><div><a href=""(?<id>[^""]+?)\?");

I receive:

matches[0].Groups["id"].Value = "/group/47502002094086";
matches[1].Groups["id"].Value = "/dk";
matches[2].Groups["id"].Value = "/prostooglavnom";

How to edit my regexp or add smth, to receive in matches only

matches[0].Groups["id"].Value = "47502002094086";
matches[1].Groups["id"].Value = "prostooglavnom";

Any help?=\ Full html code : http://pastebin.com/xEJNiD4G

1 Answers1

7

You have just discovered for yourself why Regex is a poor choice for parsing HTML.

I suggest you use the HTML Agility Pack to parse and query your HTML.

The source download comes with many example projects.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Saw the question title in the list and thought "even money says someone has posted a link to THAT answer" – Kevin Dec 14 '12 at 14:15
  • I don't want to use this lib. I want to use regexp=\ – user1895750 Dec 14 '12 at 14:40
  • 1
    @user1895750 - But regex is not a great option. Can you explain why the HAP is such a bad choice for you? Why does it have to be a regex? – Oded Dec 14 '12 at 14:42