1

I am trying to get the website name from the host url(e.g "www.google.com" -> google, "facebook.com" -> facebook) Currently I have this simple function:

   private fun getWebsiteNameFromUri(host: String): String {
        val cutString = host.substringBefore(".")
        
        return cutString.substringAfter(".")
    }

It doesn't work bad for many website, but of course there are MANY others that it's not working correctly, for example: "medium.com" just return com

I tried also count '.', and other approches but again, work for some and doesn't work for others.

There is any convention for extracting such a thing? If not, How can I extract the website name, heavy regex is the only option?

ScrapeW
  • 489
  • 8
  • 16
  • I'm not a web developer so I'm not very familiar with all the possible cases, but shouldn't be a clear way of define the name of the website? If I go with your example. domain is the name and .google is the extension, doesn't it? BTW I thought about getting the title/web site directy through the html using ```jsoup``` but it seems that many sites have a long title/missing the name part @AdamMillerchip – ScrapeW Oct 14 '20 at 09:33

1 Answers1

0

maybe try to put your link in URL class like this

URL url = new URL(address);

and then you may use

String host = url.getHost();

and lot of other methods

edit: this question looks like duplicate...

snachmsm
  • 17,866
  • 3
  • 32
  • 74
  • The host that the function get it's after calling ```url.getHost```, it's returning the full host path( www.google.com) – ScrapeW Oct 14 '20 at 09:44
  • I think this is actually the best idea, but if you pass a `String mediumCom = "medium.com";` this will throw a `MalformedURLException` due to the missing protocol. If one can make sure there will be a protocol, the `URL` will definitive be a good way to go. – deHaar Oct 14 '20 at 09:45
  • @ScrapeW You narrow down the cases to be handled manually by doing this. – deHaar Oct 14 '20 at 09:45
  • check out my edit and linked same question with multiple answers – snachmsm Oct 14 '20 at 09:47