0

So I have written a small function to remove sub-domains (if any) from string of input domains:

def rm(text):
    print(text.replace(text, '.'.join(text.split('.')[-2:])), end="")
    print("\n")

if __name__ == "__main__":
    rm("me.apple.com")
    rm("not.me.apple.com")
    rm("really.not.me.apple.com")
    # problem here
    rm("bbc.co.uk")

It all but works fine until you have .something.something tld., like .co.uk or .co.in.

So my output is:

apple.com
apple.com
apple.com
--> co.uk

Where it should have been,

apple.com
apple.com
apple.com
bbc.co.uk

How do I fix/create the function in an elegant way instead of checking for all possible double tlds? Edit: I will have to check millions of domains, if that matters. So what I would do is to pass a domain to my function and get a clean, subdomain free domain.

Jishan
  • 1,654
  • 4
  • 28
  • 62
  • @StephenRauch a function. Probably fast, as I will pass it one domain at a time, from a list of say, 1 million domains. – Jishan Dec 17 '17 at 07:11
  • 1
    There's no other way. `co.uk` isn't a valid domain, but `co.de` is. Thus, `foo.co.uk` would reduce to `foo.co.uk`, but `foo.co.de` should become just `co.de`. There are a few libraries that handle all of these special cases. – Blender Dec 17 '17 at 07:15
  • @Blender My thoughts exactly! That is the reason I got stuck! Btw, can you name the libraries? – Jishan Dec 17 '17 at 07:16
  • 2
    How about? https://github.com/john-kurkowski/tldextract – Stephen Rauch Dec 17 '17 at 07:18
  • 1
    Possible dup: https://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url – Stephen Rauch Dec 17 '17 at 07:18

2 Answers2

7

The tldextract package should do the heavy lifting for you, based on the public suffix list. It isn't bullet proof, but should work for all the reasonable usecases:

import tldextract
def rm(text):
    return tldextract.extract(text).registered_domain
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Mureinik
  • 297,002
  • 52
  • 306
  • 350
  • Did like this: `ext = tldextract.extract(domain_with_subdomain); return ext.registered_domain` – Jishan Dec 17 '17 at 07:30
  • 1
    @Jeet.Deir Completely forgot about that one! Definitely simpler than my initial suggestion. Edited and fixed. – Mureinik Dec 17 '17 at 07:34
1

You can't. Not without querying some sort of service--DNS at a minimum--or encoding a database of answers in your function.

Why not? Because you can't describe precisely in words what you are trying to do. For example, "me.apple.com" should resolve to "apple.com", "me.apple.co.uk" should resolve to "apple.co.uk", but what should "a.b.c.d.e" resolve to? There's no way to know unless the examples are cherry-picked in a way that their content suggests (but still does not define) the right answer.

Once you come up with a textual description of the algorithm, it will be implementable.

You can use a "whois" service to do the heavy lifting: https://www.whois.com/whois/ - this does what you want if you're willing to make HTTP requests.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Guessed as much :(. Any libraries that understands this, sans dnspython for querying individually, as that would be too costly and time consuming. – Jishan Dec 17 '17 at 07:18
  • I believe there are databases with 'known' domain suffixes. Mozilla has one, I believe. – ddofborg Oct 27 '22 at 12:06