regex pattern matching for http

Question

i want to extract url from href of a webpage...for that i m using the regex pattern as "(?(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)"

to extract the href from html i used this pattern @"href=\""(?[^\""#]?(?=[\""#]))(?(?#{2}[^#]?#{2})*)(?#[^""]+)?"""

but the problem is...it do not extract urls from the href but urls like "www.seo-sem.com"..and in the result i only get.."www.seo"...after the hyphen it gets truncated...plz could u sugest a better regex pattern to extract url from href..will be thankful to u...

Don't use regex to parse HTML. Find a simple library like HTMLAgilityPack and use that. — Stephan, May 10 '10 at 17:55
Even for basic URI matching the regular expression needed is *Ugly* (yes, capital U). — Joey, May 10 '10 at 17:57
@rebus, well, it's not so much HTML parsing, actually. It doesn't try to do anything with the actual *structure* of the document. For simply grabbing anything that looks like `href='url'` regex may just be appropriate enough. — Joey, May 10 '10 at 17:58
(http://|https://)?([\w.-]+)?([\w-]+\.[\w-]+) with `\2` and `\3` backrefs referencing subdomains and domain respectively would help probably, but by no means would it catch all possible domain names out there. — Davor Lucic, May 10 '10 at 18:25

score 4 · Answer 1 · edited May 23 '17 at 10:32

4

Use the HTML Agility Pack to parse your HTML. You can query it using Xpath, as it parses the HTML into a XmlDocument like object.

See this for reasons not to parse HTML with regular expressions.

edited May 23 '17 at 10:32

Community

1
1

answered May 10 '10 at 17:57

Oded

489,969
99
883
1,009

i resolved the hyphen issue...edited regex..thanks anyways..u all rock..keep it up – jaskirat May 10 '10 at 18:49

regex pattern matching for http

1 Answers1