Getting domain of an URL with Regular Expressions

Question

I'm trying to get the domain of a given URL. For example http://www.facebook.com/someuser/ will return facebook.com. The given URL can be on these formats:

https://www.facebook.com/someuser (www. is optional, but should be ignored)
www.facebook.com/someuser (http:// is not required)
facebook.com/someuser
http://someuser.tumblr.com -> this has to return tumblr.com only

I wrote this regex:

/(?: \.|\/{2})(?: www\.)?([^\/]*)/i

But it does not work as I expect.

I can do this in parts:

Remove http:// and https://, if present on string, with string.delete "/https?:\/\//i".
Remove www. with string.delete "/www\./i".
Get the domain with match and /(\w+\.\w+)+/i

But this won't work with subdomains. String for testing:

https://www.facebook.com/username
http://last.fm/user/username
www.google.com
facebook.com/username
http://sub.tumblr.com/
sub.tumblr.com

I need this to work with the minimum memory and processing coast as possible.

Any ideas?

possible duplicate of http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url a qick googling returns his link http://www.ruby-forum.com/topic/160877 — Fredrik Pihl, Jul 25 '11 at 22:32

score 12 · Accepted Answer · answered Jul 25 '11 at 22:35

12

Why don't you just use the URI class to do this?

URI.parse( your_uri ).host

And you're done.

Just one thing, if there's no "http://" or "https://" at the beginning of the url, you'll have to add one, or the parse method is not going to give you a host (it's going to be nil).

answered Jul 25 '11 at 22:35

Maurício Linhares

39,901
14
121
158

Unfortunally, this doesn't ignore subdomains. Anyway, I'll be using this solution. – Fábio Perez Jul 25 '11 at 23:10
3

To ignore subdomains, try `hostname.split('.').last(2).join('.')`. – Lars Haugseth Jul 26 '11 at 13:33
1

But don't forget you could have arbitrary large subdomains, like "me.you.we.business.com". Subdomains can go 127 levels deep. – Maurício Linhares Jul 26 '11 at 13:35
1

This doesn't get the domain name, this gets the host name. Not sure how this is accepted, it doesn't answer the question. – dev_row Aug 28 '15 at 21:18
This does not work for 'cta-redirect.hubspot.com' :( – Lalit Kumar Maurya Sep 12 '17 at 10:54

score 2 · Answer 2 · answered Oct 11 '11 at 22:40

This works for me: /^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z]{2,})+[A-Z\/]/i It will always give you the domain part only Take a look at it at: http://rubular.com/r/0hudnJSgVT

To use it create a method like this, I put it in my helpers so I have access to in in the views.

def website_url(website_url)
    if website_url[/^h?t?t?p?s?:?\/?\/?w?w?w?\.?(.*\.[A-Z\/]{2,})$/i]
      website_id = $1
    end

    %Q{http://#{ website_id }}
  end

score 1 · Answer 3 · answered Jul 25 '11 at 22:32

1

Does it have to be a regex? You could do this also.

require 'uri'
yourURL = URI.parse('https://www.facebook.com/username')
print yourURL.host

answered Jul 25 '11 at 22:32

citizen conn

15,300
3
58
80

Lalit Kumar Maurya · Answer 4 · 2017-09-12T11:39:25.270

I have created a function for String class through Open Classes technique for my purpose.

class String
  def to_dn
    return '' if self.blank?
    return self.split('@').last if self.match('@')
    link = self
    link = "http://#{link}" unless link.match(/^(http:\/\/|https:\/\/)/)
    link = URI.parse(URI.encode(link)).host.present? ? URI.parse(URI.encode(link)).host : link.strip
    domain_name = link.sub(/.*?www./,'')
    domain_name = domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).to_s if domain_name.split('.').length >= 2 && domain_name.match(/[A-Z]+.[A-Z]{2,4}$/i).present?
  end
end

Example:

 1. "https://www.facebook.com/someuser".to_dn = "facebook.com"
 2. "www.facebook.com/someuser".to_dn = "facebook.com"
 3. "facebook.com/someuser".to_dn = "facebook.com"
 4. "http://someuser.tumblr.com".to_dn = "tumblr.com" 
 5. "dc.ads.linkedin.com".to_dn = "linkedin.com" 
 6. 'your_name@domain.com'.to_dn = "domain.com"

It also work for email addresses (which require for my purpose). Hope it will useful of others. Correct me if you find anything incorrect :)

Note: It will not works for 'www.domainname.co.in'. I am working on it :)

Paul · Answer 5 · 2011-07-25T23:25:36.840

0

You could use this regex:

/(\w+\.\w{2,6})(?:\/|$)/

edited Jul 25 '11 at 23:25

answered Jul 25 '11 at 22:38

Paul

139,544
27
275
264

Not sure why you got downvoted, but your answer is technically wrong as you're not escaping your dots. Even if you did escape them, you wouldn't match facebook.com/username or the last.fm example due to that required first dot. – pbaumann Jul 25 '11 at 23:23
@pbaumann, you're right. It was working in my tests but I copied it over wrong. I had to escape the . and remove the first. – Paul Jul 25 '11 at 23:29

score 0 · Answer 6 · answered Jul 25 '11 at 23:27

If you really wanted to use a regex, you could try something along the lines of:

test_string.scan(/\w+\.\w+(?=\/|\s|$)/) { |match| do_stuff_with(match) }

This wouldn't account for domain names such as something.co.uk but it would match everything in your test string.

Getting domain of an URL with Regular Expressions

6 Answers6