0

i want to extract url from href of a webpage...for that i m using the regex pattern as "(?(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)"

to extract the href from html i used this pattern @"href=\""(?[^\""#]?(?=[\""#]))(?(?#{2}[^#]?#{2})*)(?#[^""]+)?"""

but the problem is...it do not extract urls from the href but urls like "www.seo-sem.com"..and in the result i only get.."www.seo"...after the hyphen it gets truncated...plz could u sugest a better regex pattern to extract url from href..will be thankful to u...

jaskirat
  • 39
  • 1
  • 7
  • 3
    Don't use regex to parse HTML. Find a simple library like HTMLAgilityPack and use that. – Stephan May 10 '10 at 17:55
  • No one posted the link yet? :) – Davor Lucic May 10 '10 at 17:56
  • Even for basic URI matching the regular expression needed is *Ugly* (yes, capital U). – Joey May 10 '10 at 17:57
  • @rebus, well, it's not so much HTML parsing, actually. It doesn't try to do anything with the actual *structure* of the document. For simply grabbing anything that looks like `href='url'` regex may just be appropriate enough. – Joey May 10 '10 at 17:58
  • (http://|https://)?([\w.-]+)?([\w-]+\.[\w-]+) with `\2` and `\3` backrefs referencing subdomains and domain respectively would help probably, but by no means would it catch all possible domain names out there. – Davor Lucic May 10 '10 at 18:25

1 Answers1

4

Use the HTML Agility Pack to parse your HTML. You can query it using Xpath, as it parses the HTML into a XmlDocument like object.

See this for reasons not to parse HTML with regular expressions.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009