6

I'm trying to get the domain of a given URL. For example http://www.facebook.com/someuser/ will return facebook.com. The given URL can be on these formats:

  1. https://www.facebook.com/someuser (www. is optional, but should be ignored)
  2. www.facebook.com/someuser (http:// is not required)
  3. facebook.com/someuser
  4. http://someuser.tumblr.com -> this has to return tumblr.com only

I wrote this regex:

/(?: \.|\/{2})(?: www\.)?([^\/]*)/i

But it does not work as I expect.

I can do this in parts:

  1. Remove http:// and https://, if present on string, with string.delete "/https?:\/\//i".
  2. Remove www. with string.delete "/www\./i".
  3. Get the domain with match and /(\w+\.\w+)+/i

But this won't work with subdomains. String for testing:

https://www.facebook.com/username
http://last.fm/user/username
www.google.com
facebook.com/username
http://sub.tumblr.com/
sub.tumblr.com

I need this to work with the minimum memory and processing coast as possible.

Any ideas?

JosephRuby
  • 475
  • 3
  • 10
Fábio Perez
  • 23,850
  • 22
  • 76
  • 100
  • possible duplicate of http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url a qick googling returns his link http://www.ruby-forum.com/topic/160877 – Fredrik Pihl Jul 25 '11 at 22:32

6 Answers6

12

Why don't you just use the URI class to do this?

URI.parse( your_uri ).host

And you're done.

Just one thing, if there's no "http://" or "https://" at the beginning of the url, you'll have to add one, or the parse method is not going to give you a host (it's going to be nil).

Maurício Linhares
  • 39,901
  • 14
  • 121
  • 158
2

This works for me: /^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z]{2,})+[A-Z\/]/i It will always give you the domain part only Take a look at it at: http://rubular.com/r/0hudnJSgVT

To use it create a method like this, I put it in my helpers so I have access to in in the views.

def website_url(website_url)
    if website_url[/^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z\/]{2,})$/i]
      website_id = $1
    end

    %Q{http://#{ website_id }}
  end
gabo
  • 1,133
  • 8
  • 7
1

Does it have to be a regex? You could do this also.

require 'uri'
yourURL = URI.parse('https://www.facebook.com/username')
print yourURL.host
citizen conn
  • 15,300
  • 3
  • 58
  • 80
0

I have created a function for String class through Open Classes technique for my purpose.

class String
  def to_dn
    return '' if self.blank?
    return self.split('@').last if self.match('@')
    link = self
    link = "http://#{link}" unless link.match(/^(http:\/\/|https:\/\/)/)
    link = URI.parse(URI.encode(link)).host.present? ? URI.parse(URI.encode(link)).host : link.strip
    domain_name = link.sub(/.*?www./,'')
    domain_name = domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).to_s if domain_name.split('.').length >= 2 && domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).present?
  end
end

Example:

 1. "https://www.facebook.com/someuser".to_dn = "facebook.com"
 2. "www.facebook.com/someuser".to_dn = "facebook.com"
 3. "facebook.com/someuser".to_dn = "facebook.com"
 4. "http://someuser.tumblr.com".to_dn = "tumblr.com" 
 5. "dc.ads.linkedin.com".to_dn = "linkedin.com" 
 6. 'your_name@domain.com'.to_dn = "domain.com"

It also work for email addresses (which require for my purpose). Hope it will useful of others. Correct me if you find anything incorrect :)

Note: It will not works for 'www.domainname.co.in'. I am working on it :)

Lalit Kumar Maurya
  • 5,475
  • 2
  • 35
  • 29
0

You could use this regex:

/(\w+\.\w{2,6})(?:\/|$)/
Paul
  • 139,544
  • 27
  • 275
  • 264
  • Not sure why you got downvoted, but your answer is technically wrong as you're not escaping your dots. Even if you did escape them, you wouldn't match facebook.com/username or the last.fm example due to that required first dot. – pbaumann Jul 25 '11 at 23:23
  • @pbaumann, you're right. It was working in my tests but I copied it over wrong. I had to escape the . and remove the first. – Paul Jul 25 '11 at 23:29
0

If you really wanted to use a regex, you could try something along the lines of:

test_string.scan(/\w+\.\w+(?=\/|\s|$)/) { |match| do_stuff_with(match) }

This wouldn't account for domain names such as something.co.uk but it would match everything in your test string.

pbaumann
  • 652
  • 5
  • 12