0

I know there are an infinite number of threads asking this question, but I have not been able to find one that can help me with this.

I am basically trying to parse a list of around 10,000,000 URLs, make sure they are valid per the following criteria and then get the root domain URL. This list contains just about everything you can imagine, including stuff like (and the expected formatted url):

biy.ly/test [VALID] [return - bit.ly]
example.com/apples?test=1&id=4 [VALID] [return - example.com]
host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com]
101.121.44.xxx [**inVALID**] [return false]
localhost/noway [**inVALID**] [return false]
www.awesome.com [VALID] [return - awesome.com]
i am so awesome [**inVALID**] [return false]
http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com]
www1.151.com/searchresults [VALID] [return - 151.com]

Does any one have any suggestions for this?

Rohit Chopra
  • 2,791
  • 4
  • 28
  • 33
  • You're not really validating anything with the criteria given. Do you also want to do a WHOIS lookup to see of the domain actually exists? – Brad May 03 '12 at 16:26
  • See [here][1] [1]: http://stackoverflow.com/questions/206059/php-validation-regex-for-url – yAnTar May 03 '12 at 16:27
  • 1
    What exactly are you going for? `localhost` **is** a valid URL. `someverylongdomainnamethatprobablydoesntexist.com` also is, but probably doesn't exist. – Dennis May 03 '12 at 16:27
  • @yAnTar: Syntax for links in comments is `[link text](URL)`. – Dennis May 03 '12 at 16:28
  • *"I have not been able to find one that can help me with this."* - You have not looked hard enough. – Tomalak May 03 '12 at 16:48

4 Answers4

15
^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)

Explanation

^                # start-of-line
(?:              # begin non-capturing group
  https?         #   "http" or "https"
  ://            #   "://"
)?               # end non-capturing group, make optional
(?:              # start non-capturing group
  [a-z0-9-]+\.   #   a name part (numbers, ASCII letters, dashes) & a dot
)*               # end non-capturing group, match as often as possible
(                # begin group 1 (this will be the domain name)
  (?:            #   start non-capturing group
    [a-z0-9-]+\. #     a name part, same as above
  )              #   end non-capturing group
  [a-z]+         #   the TLD
)                # end group 1 

http://rubular.com/r/g6s9bQpNnC

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 2
    Thank you for this. Love the explanation. – Rohit Chopra May 03 '12 at 18:01
  • 2
    For readers, keep in mind that urls can have non-ascii characters. This regex won't match `http://myurl.com/?utf8=✓` see (http://rubular.com/r/I4fvV3VHVT). Adding the utf8 parameter is a trick used for forcing utf8 encoding in older browsers, see (http://programmers.stackexchange.com/questions/168751/is-the-use-of-utf8-preferable-to-utf8-true) – Dana the Sane Oct 30 '12 at 14:54
  • 1
    @DanatheSane You are absolutely right. In fact, something more well thought-out like [Daring Fireball: A Liberal, Accurate Regex Pattern for Matching URLs](http://daringfireball.net/2009/11/liberal_regex_for_matching_urls) should be used. – Tomalak Oct 30 '12 at 15:09
  • Thanks for the link, comprehensive solutions to this problem seem hard to come by. – Dana the Sane Oct 30 '12 at 15:14
2

I would start with the default:

filter_var($inputUrl, FILTER_VALIDATE_URL);

Then add your special cases of things that are not acceptable for further validation. This should simplify a bit.

As for getting the host.

parse_url($inputUrl, PHP_URL_HOST);
dqhendricks
  • 19,030
  • 11
  • 50
  • 83
  • @RohitChopra that is absolutely not true. FILTER_VALIDATE_URL validates based on the RFC 2396 specifications for valid URLS. http://www.faqs.org/rfcs/rfc2396.html – dqhendricks May 03 '12 at 17:48
  • There are also two optional flags you can use with this validator, FILTER_FLAG_PATH_REQUIRED and FILTER_FLAG_QUERY_REQUIRED. – dqhendricks May 03 '12 at 17:49
0

^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$

edit

In php that would be preg_match ( '^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$' , $myUrls , $matches)

What you need would be in $matches[1]

JNF
  • 3,696
  • 3
  • 31
  • 64
  • Domain names may contain other characters than just latin symbols. This regexp fails even with `www1.151.com` mentioned in the question – galymzhan May 03 '12 at 16:38
0
$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$w$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
  {
  $websiteErr = "Invalid URL";
  }ebsite))
  {
  $websiteErr = "Invalid URL";
  }