is it possible to create a regex to remove entire domain (without subdomain) with Kotlin/Java including (or not) ports?

Question

I look in many posts for a possible answer but none of them seem to solve my problem

How to remove some part of URL by regex?

In Java, how do I extract the domain of a URL?

I basically will have an URL I dont know how it is. Like:

https://somevalue.google.com/something

or

www.somevalue.google.com/something

or

somevalue.localhost:8080/something

I will basically need to get rid of the domain ONLY (and not of the subdomain) with the http(s), www, .com or :8080 but seems to be more difficult than expected.

I have tried with this regex

"^(http[s]?://www\\.|http[s]?://|www\\.)"

And I was able to remove http, https, and/or www

From then I thought would be easier to add multiple combinations like adding

[\w] or `[?:*]` but seems like is not getting a proper match.

I based myself on this doc http://zetcode.com/kotlin/regularexpressions/ that explains what is for each patter but no success.

Any idea what Im doing wrong?

I dont want to get ride of the subdomain either somevalue

so, from

https://somevalue.google.com/something...

Get something like

somevalue/something....

So in all your examples you want to get `/something` in the end? — Thomas, Aug 23 '19 at 14:58

Thomas · Accepted Answer · 2019-08-23T15:51:20.333

In plain Java you could try the following regex: (?i)(?:[a-z]+://)?(?:[^/]+)(/.*)?

the first (?i) will make it case-insensitive
the second part ((?:[a-z]+://)?) will match an optional protocol in a non-capturing group
the third part ((?:[^/]+)) will match anything up to the next slash, i.e. the domain and any optional port, also in a non-capturing group
the last part ((/.*)?) will capture anything starting with a slash (if present) into a capturing group - that's the group you want to keep

Edit:

It seems I missed that you want to keep the subdomains as well. Try the following adjusted query:
(?i)^(?:[a-z]+://)?(?:www\.)?(.*?)(?:\.[^./]+){2}(/.*)?$

Changes:

I added ^...$ to match the entire string, which is needed for the next part
right after the protocol part (?:www\.)? will match www. if that is present
after that (.*?) will match the subdomain if present
the domain part has been changed from (?:[^/]+) to (?:\.[^./]+){2} which now matches any sequence of a dot followed by anthing except a dot or a slash and that repeated 2 times. That would match google.com, .google.com, google.com:1234 etc.

To get somevalue/something... from https://www.somevalue.google.com:1234/something... you'd then use the following code in Java:

String regex = "(?i)^(?:[a-z]+://)?(?:www\\.)?(.*?)(?:\\.[^./]+){2}(/.*)?$";
String replaced = "https://www.somevalue.google.com:1234/something...".replaceAll(regex, "$1$2");

Note that this might still not fit all your requirements (which we don't know exactly) so keep in mind that if they get more complex it might be better/easier to use parse the url properly.

unfortunately, when I use this I get my whole string removed :| — jpganz18, Aug 23 '19 at 15:13
@jpganz18 well, you didn't state how exactly you use that expression. In Java you could do something like this: `String replaced = original.replaceAll(regex, "somevalue$1");` to get `somevalue/something`. Alternatively remove the last part that contains the capturing group from the regex and use `original.replaceAll(regex,"somevalue")`. — Thomas, Aug 23 '19 at 15:31

is it possible to create a regex to remove entire domain (without subdomain) with Kotlin/Java including (or not) ports?

1 Answers1