I'm having trouble parsing urls in Postgres. I have a database full of customers and urls associated with them. I need an array of the unique domains associated with each customer. I'd love to be able to do the parsing in my query instead of dumping my results to Python and parsing it there.
In the postgres docs I found this, but can't figure out how to incorporate it into my query:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html
(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)
I'm starting with a table, like this:
customer_id | url
-------------+--------------------
000001 | www.example.com/fish
000001 | www.example.com/potato
000001 | www.potato.com/artichoke
000002 | www.otherexample.com
My code so far:
SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id
Which gives me:
customer_id | unique_domains
-----------------------------
000001 | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002 | {www.otherexample.com}
I want a table like this:
customer_id | unique_domains
-----------------------------
000001 | {example.com, potato.com}
000002 | {otherexample.com}
Working on a PostgreSQL 9.3.3 database that lives on AWS.