0

There are simple regexes and other easy ways to do this, but they don't handle the edge cases (described below) very well, which is why I'm asking a new question.

I'm trying to write a fairly simple function to parse DNS responses in C#, but there are some details that are annoying, so I'm asking to see the best way to solve it.

I want to take a url string, like test.bounce.twitter.com and remove the subdomains to get twitter.com.

However, the edge cases are messing me up:

bounce.twitter.com    ---> twitter.com
twitter.com           ---> twitter.com
amazon.co.uk          ---> amazon.co.uk (Notice the .co.uk domain!)
news.home.barclays    ---> home.barclays (Note the gTLD with longer than 3 characters; website was formerly barclays.com, belongs to billion dollar bank)

Notably, for TLDs like .uk, taking a URL like news.mail.amazon.co.uk and returning co.uk isn't useful, even though co.uk is technically the correct Second Level Domain.

In addition, this list will parse URLs with domains that have gTLDs, like website.photography or website.club or website.gallery, so hard coding a list of TLDs gets messy really quickly.

How can I write a way to strip subdomains from the URL, while still handing the edge cases? Things like handling country codes, gTLDs, etc.

james chang
  • 199
  • 1
  • 1
  • 8
  • not sure why people are voting this as offtopic? – Keith Nicholas May 04 '17 at 23:08
  • 2
    see http://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url – Keith Nicholas May 04 '17 at 23:10
  • Possible duplicate of [How to extract top-level domain name (TLD) from URL](http://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url) – itsme86 May 04 '17 at 23:20
  • Thanks a lot for the link, it's a good start. It doesn't handle unicode gTLDs well though (it's a country code list for the most part). It's also not in C# but translating it should be easy enough – james chang May 04 '17 at 23:30

0 Answers0