There are simple regexes and other easy ways to do this, but they don't handle the edge cases (described below) very well, which is why I'm asking a new question.
I'm trying to write a fairly simple function to parse DNS responses in C#, but there are some details that are annoying, so I'm asking to see the best way to solve it.
I want to take a url string, like test.bounce.twitter.com
and remove the subdomains
to get twitter.com
.
However, the edge cases are messing me up:
bounce.twitter.com ---> twitter.com
twitter.com ---> twitter.com
amazon.co.uk ---> amazon.co.uk (Notice the .co.uk domain!)
news.home.barclays ---> home.barclays (Note the gTLD with longer than 3 characters; website was formerly barclays.com, belongs to billion dollar bank)
Notably, for TLDs like .uk, taking a URL like news.mail.amazon.co.uk
and returning co.uk
isn't useful, even though co.uk
is technically the correct Second Level Domain.
In addition, this list will parse URLs with domains that have gTLDs, like website.photography
or website.club
or website.gallery
, so hard coding a list of TLDs gets messy really quickly.
How can I write a way to strip subdomains from the URL, while still handing the edge cases? Things like handling country codes, gTLDs, etc.