1

I have a PostgreSQL table with the following relevant fields:

url
title
created_at

There can be many rows that contain identical URLs but different titles. Here are some sample rows:

www.nytimes.com | The New York Times         | 2016-01-01 00:00:00`
www.wsj.com     | The Wall Street Journal    | 2016-01-03 15:32:13`
www.nytimes.com | The New York Times Online  | 2016-01-06 07:19:08`

I'm trying to obtain an output that lists the following fields:

1) url
2) title that corresponds to the highest value of created_at
3) count of all title for that unique url

So, output rows for the above sample would look something like this:

www.nytimes.com | The New York Times Online | 2
www.wsj.com     | The Wall Street Journal   | 1

Based on the numerous SO posts I've read on similar questions, it looks like my best option for obtaining the first two fields (url and latest title) would be to use DISTINCT ON:

select distinct on (url) url, title from headlines order by url, created_at desc 

Likewise, to obtain the first and third fields (url and count of all title), I could simply use GROUP BY:

select url, count(title) from headlines group by url

What I can't figure out is how to combine the above methodologies and obtain the above-mentioned three values I'm trying to get.

(Edited to provide more clarity.)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
jayp
  • 192
  • 2
  • 13

3 Answers3

5

This can be done in a single SELECT with a single scan over the table - by combining a window function with DISTINCT ON:

SELECT DISTINCT ON (url)
       url, title, count(*) OVER (PARTITION BY url) AS ct 
FROM   headlines 
ORDER  BY url, created_at DESC NULLS LAST;

SQL Fiddle.

Related (with detailed explanation):

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
3

Try;

select t1.url, t2.title, t1.cnt
from (
  select url, count(title) cnt 
  from headlines 
  group by url
) t1
join (
  select distinct on (url) url, title 
  from headlines 
  order by url, created_at desc
) t2 on t1.url = t2.url
order by t1.url

join both queries on url

sql fiddle demo

Praveen
  • 8,945
  • 4
  • 31
  • 49
  • I like both answers (yours and @Erwin's) but chose this one because it seems to run slightly faster on my dataset. – jayp Jan 10 '16 at 19:36
  • 1
    @jayp: Best performance depends on details of your setup and data distribution. There is no single query for this performing best in every case. [Details in the provided link .](http://stackoverflow.com/a/7630564/939860) – Erwin Brandstetter Jan 11 '16 at 16:48
1

Try this:

select t1.url,t1.title,t2.count from headlines t1 
inner join(
select url,count(*) as count,max(created_at) as created_at
from headlines group by url ) t2 on t1.url=t2.url and t1.created_at=t2.created_at;

SQL Fiddle: http://sqlfiddle.com/#!15/f7665f/11

Abhishek Ginani
  • 4,511
  • 4
  • 23
  • 35