Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R
, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R
that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI
seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R
function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
- Match domain against all rules and take note of the matching ones.
- If no rules match, the prevailing rule is "*".
- If more than one rule matches, the prevailing rule is the one which is an exception rule.
- If there is no matching exception rule, the prevailing rule is the one with the most labels.
- If the prevailing rule is a exception rule, modify it by removing the leftmost label.
- The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
- The registered or registrable domain is the public suffix plus one additional label.