HI
I want regex option that find website links like here :
www.yahoo.com
yahoo.com
http://www.yahoo.com
http://yahoo.com
yahoo.jp ( or any domain)
http://yahoo.fr
is there anyway to track them all with regex ?
I'm going to throw out an alternative here, not RegEx at all. Take a look at the HTML Agility Pack, your case would look like this:
var doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[contains(@href, 'yahoo')]"])
{
var href = link["href"];
//href is a url that contains the word `yahoo`, do something with it
}
It's not really answering the question as you've written is, just something to keep your options open, as RegEx can have many other problems when applied against HTML.
This regex from daringfireball.net should be able to do most what you want. I'm unsure about domain.tld
since that is very ambiguous.
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
For more specifics about what it does check out http://daringfireball.net/2010/07/improved_regex_for_matching_urls