2

HI

I want regex option that find website links like here :

www.yahoo.com
yahoo.com
http://www.yahoo.com
http://yahoo.com
yahoo.jp ( or any domain)
http://yahoo.fr

is there anyway to track them all with regex ?

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
pedram
  • 3,647
  • 6
  • 24
  • 28

2 Answers2

1

I'm going to throw out an alternative here, not RegEx at all. Take a look at the HTML Agility Pack, your case would look like this:

var doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[contains(@href, 'yahoo')]"])
{
  var href = link["href"];
  //href is a url that contains the word `yahoo`, do something with it
}

It's not really answering the question as you've written is, just something to keep your options open, as RegEx can have many other problems when applied against HTML.

Community
  • 1
  • 1
Nick Craver
  • 623,446
  • 136
  • 1,297
  • 1,155
0

This regex from daringfireball.net should be able to do most what you want. I'm unsure about domain.tld since that is very ambiguous.

(?xi)
\b
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      /{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
                                    # (Trying not to match e.g. "URI::Escape")
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()<>
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

For more specifics about what it does check out http://daringfireball.net/2010/07/improved_regex_for_matching_urls

ase
  • 13,231
  • 4
  • 34
  • 46
  • I used that , wand works find . but a bit problem, how can I find returned texts ? I used MatchCollection mc18 = Regex.Matches(text, regexOption, RegexOptions.IgnoreCase); what should I do know to find texts ? regards – pedram Aug 01 '10 at 11:39
  • Are you looking to replace these occurrences or do you simply wish to find them? – ase Aug 01 '10 at 11:42
  • also a question how can I track that if the link is between {} like { www.yahooo.com } or {www.yahooo.com} regards – pedram Aug 01 '10 at 11:52