0

I want to get the name of a website from a url in a very simple way. Like, I have the URL "https://www.google.com/" or any other url, and I want to get the "google" part.

The issue is that there could be many pitfalls. Like, it could be www3 or it could be http for some reason. It could also be like the python docs where it says "https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse". I only want "python" in that case.

Is there a simple way to do it? The only one I can think of is just doing lots and lots of string.removeprefix or something like that, but thats ugly. I could not find anything that resembled what I searched for in the urllib library, but maybe there is another one?

1 Answers1

0

Here's an idea:

import re

url = 'https://python.org'
url_ext = ['.com', '.org', '.edu', '.net', '.co.uk', '.cc', '.info', '.io']

web_name = ''
# Cuts off extension and everything after
for ext in url_ext:
    if ext in url:
        web_name = url.split(ext)[0]

# Reverse the string to find first non-alphanumeric character
web_name = web_name[::-1]
final = re.search(r'\W+', web_name).start()
final = web_name[0 : final]

# Reverse string again, return final
print(final[::-1])

The code starts by cutting off the extension of the website and everything that follows it. It then reverses the string and looks for the first non-alphanumeric character and cuts off everything after that utilizing the regex library. It then reverses the string again to print out the final result.

This code is probably not going to work on every single website as there are a million different way to structure a URL but it should work for you to some degree.

Jackie
  • 198
  • 12
  • I mean I do know how to do it with funny splits and remove parts, but I dont want them with a lookup table. But as you said, it seems like a non trivial exercise to get all the edge cases. For example for my application I would need to add .me in your list. Thats why I figured that there was gonna be a general solution. I now took the answer I didnt find before, it should be listed as duplicate, so tldextract.extract(site).domain from the tldextract module. that makes it a oneliner that will hopefully work well. – JustSomeCoder Sep 27 '22 at 10:03