I'm working on a page tracking web app and I'd like to get the canonical domain for a list of sites. As far as I know there is no good way of telling where a site's ownership of subdomains and top level domains starts and ends. I'm not sure the best way to describe that, so here is an example:
If I own a personal URL, mysite.com
, I am able to set up subdomains such as www.mysite.com
, cdn.mysite.com
, and so forth.
If my "group" has a website at a university, such as computerscience.myuni.edu
, I might have also have control over www.computerscience.myuni.edu
, but not myuni.edu
If I am a huge business and and need to spread web traffic out, I might even have www.acme.com
, ww2.acme.com
, ww3.acme.com
, etc.
So nothing is certain but if I'm given a URL I can probably strip of www.
, ww2.
, and cdn.
, and maybe secure.
from the front, but are there any other common "subdomains" that I'm not thinking of that are fairly common and generally not used to serve up a different website?
I'm guess I'm just trying to figure out the best way to get the real "canonical" domain name for a site.