0

I want to query the string between https:// or http:// and the first delimeter characters that comes after it. For example, if the field contains:

https://google.com/en/
https://www.yahoo.com?en/

I want to get:

google.com
www.yahoo.com

My initial query that will capture the / only contains two substring_index as follows:

SELECT substring_index(substring_index(mycol,'/',3),'://',-1)
FROM mytable;

Now I found that the URLs may contain multiple delimeters. I want my statament to capture multiple delimeters possibilities which are (each one is a separate character):

:/?#[]@!$&'()*+,;=

How to do this in my statement? I tried this solution but the end result the command could not be executed due to syntax error while I am sure I followed the solution. Can anyone help me correctly construct the query to capture all the delimeter characters I listed above?

I use MySQL workbecnh 6.3 on Ubuntu 18.04.

EDIT:

Some corrections made in the first example of URLs.

user9371654
  • 2,160
  • 16
  • 45
  • 78

2 Answers2

0

In MySQL 8+, this should work:

SELECT regexp_replace(regexp_substr(mycol, '://[a-zA-Z0-9_.]+[/:?]'), '[^a-zA-Z0-9_.]', '')
FROM (SELECT 'https://google.com/en' as mycol union all
      SELECT 'https://www.yahoo.com?en'
     ) x

In older versions, this is much more challenging because there is no way to search for a string class.

One brute force method is:

select (case when substring_index(mycol, '://', -1) like '%/%'
             then substring_index(substring_index(mycol, '://', -1), '/', 1)
             when substring_index(mycol, '://', -1) like '%?%'
             then substring_index(substring_index(mycol, '://', -1), '?', 1)
             . . .   -- and so on for each character
             else substring_index(mycol, '://', -1) 
        end) as what_you_want

The [a-zA-Z0-9_.] is intended to be something like the valid character class for your domain names.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I do not get: In `order versions`, do you mean `older versions`? Anyways, I am using 6.3 so it is ok. But the challenge to capture the domain names characters. They are not just alphabetical characters. They contains numbers and dashes `-` if I am correct. – user9371654 Apr 13 '19 at 14:04
  • Are you sure the `?` character in `LIKE` should be preceded by `\` as you write `%\?%` I looked at special characters in `LIKE` [here](https://dev.mysql.com/doc/refman/8.0/en/string-literals.html) but it seems only the `%` and `_` should be escaped?? plz correct me if I'm wrong? – user9371654 Apr 13 '19 at 14:39
  • @user9371654 . . . `?` does not need to be escaped for `LIKE`. – Gordon Linoff Apr 13 '19 at 15:38
0

First, note that https://www.yahoo.com?en/ seems like an unlikely URL, because it has a path separator contained inside the query string. In any case, if you are using MySQL 8+, then consider using its regex functionality. The REGEXP_REPLACE function can be helpful here, using the following pattern:

https?://([A-Za-z_0-9.-]+).*

Sample query:

WITH yourTable AS (
    SELECT 'https://www.yahoo.com?en/' AS url UNION ALL
    SELECT 'no match'
)

SELECT
    REGEXP_REPLACE(url, 'https?://([A-Za-z_0-9.-]+).*', '$1') AS url
FROM yourTable
WHERE url REGEXP 'https?://[^/]+';

Demo

The term $1 refers to the first capture group in the regex pattern. An explicit capture group is denoted by a quantity in parentheses. In this case, here is the capture group (highlighted below):

https?://([A-Za-z_0-9.-]+).*
          ^^^^^^^^^^^^^^^

That is, the capture group is the first portion of the URL path, including domain, subdomain, etc.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    `https://www.yahoo.com?en/ seems like an unlikely URL`. This is just an example. But I have large dataset that contains such syntax or similar.. – user9371654 Apr 13 '19 at 14:00
  • Fine. My answer should handle all of your cases (I think). – Tim Biegeleisen Apr 13 '19 at 14:00
  • but domain names can contain numbers and dashes. Your `[A-Za-z_.]+` does not capture this. – user9371654 Apr 13 '19 at 14:02
  • Good point, I updated my regex. You should be able to get away with just a single call to `REGEXP_REPLACE`. – Tim Biegeleisen Apr 13 '19 at 14:04
  • and underscores are also allowed in domain names. Sorry I did not know that but a [search](https://stackoverflow.com/questions/2180465/can-domain-name-subdomains-have-an-underscore-in-it) now says they are legal. – user9371654 Apr 13 '19 at 14:11
  • Unfortunately, I get: Error Code: 1305. `REGEXP_REPLACE does not exist`. I use MysQL workbench 6.3. How to upgrade? I also use Ubuntu 18.04 so you can tell me the required command plz. – user9371654 Apr 13 '19 at 14:14
  • Checkout the [download page for MySQL 8+](https://dev.mysql.com/downloads/mysql/). If you have a long term need for such regex support, then yes, you should consider upgrading. – Tim Biegeleisen Apr 13 '19 at 14:15
  • I have `5.7.25-0ubuntu0.18.04.2`. I do not want to miss thing. If upgrading can cause issues I have big data. I already downloaded the older version and everything in mydatabase is connected and I already have schemes and data. – user9371654 Apr 13 '19 at 14:19
  • OK...at this point you have potentially 3 valid answers to the question you originally asked. If you have some upgrade issue, then you should open a new question, possibly on the Ubuntu or SuperUser site. Good luck! – Tim Biegeleisen Apr 13 '19 at 14:20
  • I upgraded. But what is '$1' is for? I looked at the manual, it is not clear for me why you are using it? Can you clarify plz. – user9371654 Apr 14 '19 at 04:33
  • & how to ensure that the resulting string that matches the regex is not null? because I added `WHERE url is NOT NULL`, but I got an error saying that no column called url. – user9371654 Apr 14 '19 at 04:36
  • Please check my updated answer. Use `WHERE col IS NOT NULL`, where `col` is the name of the column containing the URL. – Tim Biegeleisen Apr 14 '19 at 04:37
  • Please read the above comment again. Your question deviates from the OP, and changing my answer again would make it less useful for others later on. – Tim Biegeleisen Apr 14 '19 at 05:07
  • Thanks. But I want the result of the matching is NOT NULL not the col. so in this example I should do: `WHERE REGEXP_REPLACE('https://www.yahoo.com?en/', 'https?://([A-Za-z_0-9.-]+).*', '$1') IS NOT NULL`, I guess? – user9371654 Apr 14 '19 at 05:44
  • @user9371654 Check my updated query. Use `REGEXP` in the `WHERE` clause to assert that the URL matches. – Tim Biegeleisen Apr 14 '19 at 07:17