2

In a table, I have lists of full urls like :

https://www.example.com/page-1/product-x?utm-source=google

Objective : I want to extract the domain name part of the url only :

https://www.example.com/

I was using this following formula :

=REGEXEXTRACT(A1;"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)")

The regex is working fine when testing it :

https://www.example.com/

However in Google sheet, It displays like :

example.com
  • Why the resutls are not the same for the identical regex ?
  • How to correct it in google Sheet ?
B4b4j1
  • 430
  • 2
  • 7
  • 24

1 Answers1

4

You can fix the pattern by removing the capturing group (i.e. here, ([^:\/\n?]+) => [^:\/\n?]+) or by converting the capturing groups to non-capturing ones (i.e. ([^:\/\n?]+) => (?:[^:\/\n?]+)):

=REGEXEXTRACT(A1;"^(?:https?://)?(?:[^@\n]+@)?(?:www\.)?[^:/\n?]+")
=REGEXEXTRACT(A1;"^(?:https?://)?(?:[^@\n]+@)?(?:www\.)?(?:[^:/\n?]+)")

NOTE:

  • If the regex contains capturing group(s), the REGEXEXTRACT returns captured value(s)
  • If there are no capturing groups in the regex, the function returns the whole match value only.

Note you do not need to escape / forward slashes in RE2 regexps since they are defined with the help of string literals in Google Sheets.

The pattern may be reduced to ^(?:https?://)?[^:/\n?]+, that matches http:// or https:// optionally, and then matches one or more chars other than /, newline, or ?.

See this RE2 regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563