1

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

This seems unfavorable as the query needs to be constantly updated for list of unwanted words.

I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?

A dataset sample to try with:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

Desired output

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+
user123
  • 161
  • 2
  • 14
  • Can you make a list of "doubled TLDs" like co.uk? – sticky bit May 07 '19 at 09:48
  • Meaning? You want me to create more variation of doubled TLDs in the sample data? – user123 May 07 '19 at 09:53
  • No not in the sample data, that's OK as far as I'm concerned. But a possible solution I could imagine would match the end of the DNS names. But that may give you just co.uk instead of sample.co.uk. So these "doubled TLDs" need a special handling. That's why I as if you can make a list of them. After all the computer cannot "know" that co.uk is actually to be treated as *one* TLD. – sticky bit May 07 '19 at 09:58
  • This is exactly where I got stuck. The TLDs could either be .co.uk, .co or .uk. – user123 May 07 '19 at 10:02
  • One option may be to store all tlds in another table with a hierarchy and use that to extract only the required part of the URL giving preference of extraction based on that priority. – Kaushik Nayak May 07 '19 at 11:06
  • 1
    This is far more complicated than it probably seems at first glance (I've tried to do it in the past). Take a look at some of the python libraries (tld or tldextract) that do this. They generally start with the full list of tlds available here: https://publicsuffix.org/list/ . It's quite long... – Jeremy May 07 '19 at 12:30
  • Thank you for this. It is very helpful. It covers all the different TLDs in my original dataset. I have to incorporate the Python library into my SQL query. – user123 May 08 '19 at 02:30

3 Answers3

3

So, I've found the solution using Jeremy and Rémy Baron's answer.

  1. Extract all the public suffix from public suffix and store into a table which I labelled as tlds.

  2. Get the unique urls in the dataset and match to its TLD. part1

  3. Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output: final_output

The SQL query is as below:

WITH stored_tld AS(
SELECT 
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                            rows between unbounded preceding and unbounded following) AS "tld" 
FROM sample s 
JOIN tlds t 
ON (s.url like '%%'||domain))

SELECT 
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
END AS "extracted_domain" 
FROM(
    SELECT a.url,st."tld"
    FROM sample a
    LEFT JOIN stored_tld st
    ON a.url = st.url
    )t1

Links to try: SQL Tester

user123
  • 161
  • 2
  • 14
1

You can try this :

with tlds as (
     select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
    select * from (values ('sample.co.uk'),
                          ('www.sample.co.uk'),
                          ('www3.sample.co.uk'),
                          ('biz.sample.co.uk'),
                          ('digital.testing.sam.co'),
                          ('sam.co'),
                          ('m.sam.co')
                   ) a(url)
     ) 
  select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
            select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld 
               from sample join tlds on (url like '%'||tld) 
         ) a
Rémy Baron
  • 1,379
  • 8
  • 15
1

I use split_part(url,'/',3) for this :

select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;

output

stackoverflow.com
vjeantet
  • 291
  • 2
  • 5
  • If you want to remove the "www." `select regexp_replace(split_part('https://stackoverflow.com/questions/56019744', '/', 3),'^www\.','');` – Alexandre Testu Jul 27 '22 at 10:18