0

I'm trying to filter out a bunch of urls to find their base url, which doesn't include the www or any prefix, having trouble writing a expression to capture it, but with subset of TLDs, it becomes a rather more complicated issue.

answers.yahoo.com => yahoo.com
www.google.com => google.com
uk.answers.yahoo.co.uk = > yahoo.co.uk
www.g.se => g.se

Any suggestions?

I was using this expression, but it messes up when the domain name isn't more than 2 characters or when the domain tld is less than 2 characters.

(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$
Air In
  • 945
  • 1
  • 6
  • 11
  • You're going to need a list of TLDs. The good news is that a list of TLDs is maintained at http://publicsuffix.org . See this excellent question and answers: [Get the subdomain from an URL](http://stackoverflow.com/questions/288810/get-the-subdomain-from-a-url) – Li-aung Yip Apr 26 '12 at 02:54

2 Answers2

1

How do you know that the base of uk.answers.yahoo.co.uk is yahoo.co.uk, but the base of, for example, foo.bar.maps.google.com isn't maps.google.com?

Venge
  • 2,417
  • 16
  • 21
  • Because, domain names can't have "." in them. – Air In Apr 26 '12 at 02:13
  • I'm not sure what you mean. The domain name of this site is stackoverflow.com, which has a . in it. – Venge Apr 26 '12 at 02:14
  • I'm not sure if I'm using the correct terminology, but "stackoverflow" is the domain, "com" is the TLD. – Air In Apr 26 '12 at 02:17
  • You're misusing terminology; the individual parts of a domain are known as labels, and labels cannot contain periods. Basically, my question is: why is maps.google.com different from yahoo.co.uk? At a technical level there is no difference between co.uk and google.com. – Venge Apr 26 '12 at 02:23
  • The correct terminology is I need to extract the "second level domain." – Air In Apr 26 '12 at 02:27
  • According to wikipedia: URL: http://www.example.net/index.html, Top-level domain name: net, Second-level domain name: example.net, Host name: www.example.net, But some googling seems to say co.uk is a second level domain, so I'm a bit confused. – Air In Apr 26 '12 at 02:28
  • Second or Third-Level Domain which seems to be what I want, dependent on if the if it is a TLD or not. – Air In Apr 26 '12 at 02:32
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/10521/discussion-between-patrick-and-arun-force) – Venge Apr 26 '12 at 02:32
1
[^\.]*\.(?:co.uk|\w{2,3})$

You'll need to add known domains in the regex.

http://regexr.com?30p4r

Jack
  • 5,680
  • 10
  • 49
  • 74
  • Adding a list of domains is not preferable, especially since they seem to be ever increasing the list and companies can now register it. – Air In Apr 26 '12 at 02:20
  • Well then, how about adding only exceptions like `co.uk` to the list? `[^\.]*\.(?:co.uk|\w{2,3})$` but search 2-3 chars otherwise. Is this just a list of URLs only, one on each line or is it in some text? If in text, provide some examples of how they appear in text. – Jack Apr 26 '12 at 02:24