6

I am trying to construct a regex to extract a domain given a url.

for:

http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/

should give:

abc.google.com
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
anusuya
  • 653
  • 1
  • 9
  • 24

4 Answers4

25
URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"

Not a regex, but probably more robust then anything we come up with here.

URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')

If you want to remove the www. as well this will work without raising any errors if the www. is not there.

Alex Wayne
  • 178,991
  • 47
  • 309
  • 337
1

Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.

([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$
Fabian
  • 13,603
  • 6
  • 31
  • 53
0

you may be able to use the domain_name gem for this kind of work. From the README:

require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain         #=> "example.co.uk"
subelsky
  • 405
  • 6
  • 12
-1

Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:

def extract_domain
  return 'abc.google.com'
end

But that's probably not what you meant …

Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!

require 'uri'

URI.parse('https://abc.google.com/').host # => 'abc.google.com'

And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • i might have framed the qn wrongly. what am trying to do is just remove the leading "http://www." and evering thing after .com so given "http://www.google.com/" should give google.com "http://www.abc.google.com/" should return abc.google.com – anusuya Jul 24 '10 at 09:01
  • Why do you want to get abc.google.com for http://abc.google.com/ but google.com for http://www.google.com/ ? What makes the 'www' special? It is just a convention that http-servers usually are on the host named www but it don't have to be that way. – Jürgen Steinblock Jul 24 '10 at 09:07
  • yeah. i use a webservice which strips of http and www part of the sitename. to compare the results i need to do the same before doing it – anusuya Jul 24 '10 at 09:18