-1

I look in many posts for a possible answer but none of them seem to solve my problem

How to remove some part of URL by regex?

Best way to Remove the domain from url

Best way to Remove the domain from url

In Java, how do I extract the domain of a URL?

I basically will have an URL I dont know how it is. Like:

https://somevalue.google.com/something

or

www.somevalue.google.com/something

or

somevalue.localhost:8080/something

I will basically need to get rid of the domain ONLY (and not of the subdomain) with the http(s), www, .com or :8080 but seems to be more difficult than expected.

I have tried with this regex

"^(http[s]?://www\\.|http[s]?://|www\\.)"

And I was able to remove http, https, and/or www

From then I thought would be easier to add multiple combinations like adding

[\w] or `[?:*]` but seems like is not getting a proper match.

I based myself on this doc http://zetcode.com/kotlin/regularexpressions/ that explains what is for each patter but no success.

Any idea what Im doing wrong?

I dont want to get ride of the subdomain either somevalue

so, from

https://somevalue.google.com/something...

Get something like

somevalue/something....
jpganz18
  • 5,508
  • 17
  • 66
  • 115

1 Answers1

3

In plain Java you could try the following regex: (?i)(?:[a-z]+://)?(?:[^/]+)(/.*)?

  • the first (?i) will make it case-insensitive
  • the second part ((?:[a-z]+://)?) will match an optional protocol in a non-capturing group
  • the third part ((?:[^/]+)) will match anything up to the next slash, i.e. the domain and any optional port, also in a non-capturing group
  • the last part ((/.*)?) will capture anything starting with a slash (if present) into a capturing group - that's the group you want to keep

Edit:

It seems I missed that you want to keep the subdomains as well. Try the following adjusted query:
(?i)^(?:[a-z]+://)?(?:www\.)?(.*?)(?:\.[^./]+){2}(/.*)?$

Changes:

  • I added ^...$ to match the entire string, which is needed for the next part
  • right after the protocol part (?:www\.)? will match www. if that is present
  • after that (.*?) will match the subdomain if present
  • the domain part has been changed from (?:[^/]+) to (?:\.[^./]+){2} which now matches any sequence of a dot followed by anthing except a dot or a slash and that repeated 2 times. That would match google.com, .google.com, google.com:1234 etc.

To get somevalue/something... from https://www.somevalue.google.com:1234/something... you'd then use the following code in Java:

String regex = "(?i)^(?:[a-z]+://)?(?:www\\.)?(.*?)(?:\\.[^./]+){2}(/.*)?$";
String replaced = "https://www.somevalue.google.com:1234/something...".replaceAll(regex, "$1$2");

Note that this might still not fit all your requirements (which we don't know exactly) so keep in mind that if they get more complex it might be better/easier to use parse the url properly.

Thomas
  • 87,414
  • 12
  • 119
  • 157
  • unfortunately, when I use this I get my whole string removed :| – jpganz18 Aug 23 '19 at 15:13
  • 1
    @jpganz18 well, you didn't state how exactly you use that expression. In Java you could do something like this: `String replaced = original.replaceAll(regex, "somevalue$1");` to get `somevalue/something`. Alternatively remove the last part that contains the capturing group from the regex and use `original.replaceAll(regex,"somevalue")`. – Thomas Aug 23 '19 at 15:31