5

Given website addresses, e.g.

http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2

How do I return the root domain in R, e.g.

example.com
example2.co.uk

For my purposes I would define the root domain to have structure

example_name.public_suffix

where example_name excludes "www" and public_suffix is on the list here:

https://publicsuffix.org/list/effective_tld_names.dat

Is this still the best regex based solution:

https://stackoverflow.com/a/8498629/2109289

What about something in R that parses root domain based off the public suffix list, something like:

http://simonecarletti.com/code/publicsuffix/

Edited: Adding extra info based on Richard's comment

Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.

> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"

Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:

Algorithm
  • Match domain against all rules and take note of the matching ones.
  • If no rules match, the prevailing rule is "*".
  • If more than one rule matches, the prevailing rule is the one which is an exception rule.
  • If there is no matching exception rule, the prevailing rule is the one with the most labels.
  • If the prevailing rule is a exception rule, modify it by removing the leftmost label.
  • The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
  • The registered or registrable domain is the public suffix plus one additional label.
Community
  • 1
  • 1
Alex
  • 15,186
  • 15
  • 73
  • 127
  • You ask for "best" but don't say what you're requirements are. Plus "root domain" isn't well defined. Who's to say that "subdomain" isn't part of the root domain for a site? You could have completely different sites at "https://apple.example2.co.uk" and "https://microsoft.example2.co.uk" – MrFlick Oct 10 '14 at 02:39
  • Thanks, in your example I would want both of them to return `example2.co.uk`. – Alex Oct 10 '14 at 02:44
  • `XML::parseURI()$server` gets you the server name. Not sure if that's the same as the root domain, but it seems useful – Rich Scriven Oct 10 '14 at 03:02
  • thanks Richard, that is in fact, extremely helpful and can form half of the solution. The other half should probably use the public suffix list to identify the name of domain: e.g. `parseURI("http://www.omegahat.org:8080/RCurl/index.html")$server` returns `"www.blog.omegahat.org"` and some how we should get `omegahat.org`. – Alex Oct 10 '14 at 03:08
  • Your example returns `"www.omegahat.org"` I think. – thelatemail Oct 10 '14 at 03:11
  • sorry, I copied the wrong piece of code, it should have been `parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server` returns `[1] "www.blog.omegahat.org"` – Alex Oct 10 '14 at 03:14
  • How is this question opinion based? I have attempted to clearly define what I mean by root domain. There is a answer that precisely gives what I want. – Alex Oct 14 '14 at 10:18
  • 2
    Voted to re-open because I'm as confused as @Alex. The concept of an organizational domain is important, with several major uses (e.g, cookies, DMARC) and several libraries providing the functionality to identify them (e.g., tldextract in Python and R, com.google.common.net.InternetDomainName on the JVM). – Peyton Oct 23 '14 at 03:35

2 Answers2

9

There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:

host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"

The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):

domain.info <- tldextract(host)
domain.info
#                       host subdomain   domain   tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk

tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:

paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
Peyton
  • 7,266
  • 2
  • 29
  • 29
1

Somthing lik this should help

> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"

> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"
Prasanna Nandakumar
  • 4,295
  • 34
  • 63
  • 1
    The OP believes the second example should return "example2.co.uk", not "subdomain.example2.co.uk" – MrFlick Oct 10 '14 at 02:45