4

I'm having trouble parsing urls in Postgres. I have a database full of customers and urls associated with them. I need an array of the unique domains associated with each customer. I'd love to be able to do the parsing in my query instead of dumping my results to Python and parsing it there.

In the postgres docs I found this, but can't figure out how to incorporate it into my query:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  

  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html  

(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)

I'm starting with a table, like this:

customer_id | url 
-------------+--------------------   
000001      | www.example.com/fish  
000001      | www.example.com/potato  
000001      | www.potato.com/artichoke
000002      | www.otherexample.com

My code so far:

SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id

Which gives me:

customer_id | unique_domains
-----------------------------
000001      | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002      | {www.otherexample.com}

I want a table like this:

customer_id | unique_domains
-----------------------------
000001      | {example.com, potato.com}
000002      | {otherexample.com}

Working on a PostgreSQL 9.3.3 database that lives on AWS.

davidshere
  • 315
  • 3
  • 10
  • Why using the manual of version 8.3 when working on version 9.3? A lot has changed between 8.3 and 9.3 – Frank Heikens Jul 24 '14 at 08:45
  • @frank-heikens, thanks for pointing that out. The relevant portion in this doc is the same in 9.3 as 8.3, so I've just updated the link to reflect that. – davidshere Jul 24 '14 at 20:15

2 Answers2

4

The document you linked above is for use with a Postgres text search parser. That requires a separate configuration to setup, and may be more overhead and/or a different sort of thing than you are looking for.

If you do want to go that route, to setup a text parser, you can find more info here:

http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html

However, if you want to do the parsing inline in Postgres, I would recommend using a procedural Postgres language, where you can import parsing libraries in that language.

You mentioned Python, so you could use PL/Python and a url parsing library such as urlparse (called urllib.parse in Python 3).

More info about urlparse

That includes this example code:

>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

Going beyond that example, you can get the hostname with the hostname member:

>>> print o.hostname
www.cwi.nl

If you want properly parse out just the domain name (there are lots of edge cases and variants -- i.e. minus the www and any other assorted parts that may be there -- an approach such as in this answer would be best.

For more information about setting up PL/Python, you can go here:

http://www.postgresql.org/docs/9.3/static/plpython.html

So, that's how you could do the parsing in Postgres

instead of dumping my results to Python and parsing it there

It ends up coming a bit full circle with the PL/Python, but if you really want to do the parsing within SQL (especially for performance reasons, say, across a large data set), going with PL/Python may be worth the extra effort.

Community
  • 1
  • 1
khampson
  • 14,700
  • 4
  • 41
  • 43
  • Yep, that'd be my suggestion too. If performance is crucial you could instead write a simple C extension to wrap a C library that does URL parsing and return a result set from it, but that's a bit more work. – Craig Ringer Jul 24 '14 at 01:36
  • Thanks, that's really helpful. I didn't know PL/Python was a thing. I've been looking for a decent tutorial in creating functions and doing SQL scripting in general, but I haven't been successful. – davidshere Jul 24 '14 at 20:39
0

You can create a function

create or replace function fn_get_url_path(in_link text) returns text as
$$
begin
  return (select token
          from (select token, alias
                from ts_parse('default', in_link)
                       natural join ts_token_type('default')) tokens
          where tokens.alias = 'url_path');

end;
$$ language PLpgSQL;

Then use the function:

select fn_get_url_path(column1)
from (values ('https://www.example.com'),
         ('https://www.example.com/test.html'),
         ('https://www.example.com/test?a=1'),
         ('this doesnt work and will return null')) a

To get the result:

/*
  null,
  /test.html
  /test?a=1
  null
*/
Pål Thingbø
  • 1,211
  • 1
  • 17
  • 17