1

I have already viewed and tried multiple other threads and doesn't work for me. I need the regex solution for it to work and no java code that does it without regex.

Some of the threads which I have already checked: Get domain name from given url, Extract host name/domain name from URL string, and Java regex to extract domain name? None work for me, either the regex doesn't work or the solution is a java code without regex.

What I am trying to do?

Case 1:
Input: https://api.twitter.com/blog/category/2?user=42&status=enabled
Output: api.twitter.com

Input: abc.xyz.com/blog/category/2?user=42&status=enabled
Output: abc.xyz.com

Case 2:
Input: https://abc.xyz.com/blog/category/2?user=42&status=enabled
Output: xyz.com

Input: abc.xyz.com/blog/category/2?user=42&status=enabled
Output: xyz.com

I need 2 regexes to solve each case mentioned above. If it can be done in one, even that works.

I tried the below regex from the first post:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

This one works when there is https:// or any scheme but fails when there is no scheme in the URL.

So far I am solving the first case using a 2 step solution.

Step 1: Replace scheme
(.*://)(.*) -> $2
remove anything before and including string "://"

Step 2: Extract host name
([^/]*)(.*) -> $1
The first group extracts everything that is before the first "/". Basically extracting everything that isn't slash till I see the first one. 
Bikas Katwal
  • 1,895
  • 1
  • 21
  • 42
  • 1
    Why is the ```abc``` not part of the output in case 2 and how would a regex know when to use which? – geanakuch Jun 08 '21 at 05:36
  • 2
    Do you understand the regexp you quoted? (As in, enough to debug and maintain it) – Thorbjørn Ravn Andersen Jun 08 '21 at 05:38
  • For case 2, I can use 2 step solution. If case 1 works for me. First, extract the hostname then the domain name from it. – Bikas Katwal Jun 08 '21 at 05:39
  • @ThorbjørnRavnAndersen I am looking for solution. I can assure you I won't put something in production If I don't understand it. In fact, I am trying to understand the regex to make it work for my use case. I am certainly not a regex expert, at least not this complex one :D – Bikas Katwal Jun 08 '21 at 05:42
  • I need these regexes to be used in search engines(elastic search). I would have preferred a easily understandable java code, but that is something I can't put in elastic search schema mapping. – Bikas Katwal Jun 08 '21 at 05:44
  • I think you should look for another solution because regexps are not magic. Have you had a closer look on the URL class? – Thorbjørn Ravn Andersen Jun 08 '21 at 06:02
  • Do you want to match what you call the "output" as group 1 of the match? Or do you want to match any url with `".xyz.com/"` in it? – Bohemian Jun 08 '21 at 06:03
  • not necessarily group one. If I can extract that in any group that works. For instance, in the regex I quoted, the pattern I am looking for is in group 4. – Bikas Katwal Jun 08 '21 at 06:06
  • @ThorbjørnRavnAndersen Yes, I understand it is not magic. I know the URL class and the solution to it. As I said I need this for my ES analyzers. Changing at other places with a URL or a piece of java code is not possible or an option. If it is not possible I have another "inefficient" solution using ES tokenizers, that will still tokenize and keep the domain as split tokens. – Bikas Katwal Jun 08 '21 at 06:10
  • @Bohemian not specifically `.xyz.com` it can be any domain name. – Bikas Katwal Jun 08 '21 at 06:13
  • 1
    @BikasKatwal: Why does output #2 show `abc.xyz.com` but output #3 and #4 show only `xyz.com`? – anubhava Jun 08 '21 at 06:30
  • I need both the strings(domain name and host name). – Bikas Katwal Jun 08 '21 at 06:35
  • I have edited my question with my 2 step solution for the first case. It would be great if that can be done in a single step. But I think that isn't possible? As I do not know if the scheme will be there or not? – Bikas Katwal Jun 08 '21 at 06:43
  • ok [check this](https://regex101.com/r/XNO1NB/1) – anubhava Jun 08 '21 at 06:49
  • 1
    @anubhava thanks! that works :) Could you add this as the answer? and I will use this `.*` instead of `https`. As the scheme can be anything, forgot to mention it in question. – Bikas Katwal Jun 08 '21 at 07:02

1 Answers1

4

You may use this regex with optional matches and capture groups:

^(?:\w+://)?((?:[^./?#]+\.)?([^/?#]+))

RegEx Demo

RegEx Details:

  • ^: Start
  • (?:\w+://)?: Optionally match scheme names followed by ://
  • (: Start capture group #1
    • (?:[^./?#]+\.)?: Optionally match first part of domain name using a non-capture group
    • ([^/?#]+): Match 1+ of any character that is not /, ?, # in capture group #2
  • ): End capture group #1
anubhava
  • 761,203
  • 64
  • 569
  • 643