Return row info based on one distinct column value

Question

I want to return a table that consists of one column that contains only distinct projectid, a second column that displays the max(date) from within the given date-range and a third column that shows additional information. I'm still really new to sql.

Query:

select distinct (a.projectid), versiondate, newvalue 
from a 
   inner join b on a.projectid = b.projectid 
where b.contractor = 'SQA Contractor Company-1' 
  and a.attributename = 'Status' 
  and versiondate between '2014-10-01 00:00:00' and '2014-10-01 23:59:59' 
group by a.projectid, versiondate, newvalue 
order by versiondate

What it's returning:

projectid  |         versiondate    |  newvalue
-----------+------------------------+--------------------
p27641     |  2014 10 01 12:23:18   | In work         
p27641     |  2014 10 01 12:23:21   | In billing      
p27641     |  2014 10 01 12:23:45   | completed          
p19397     |  2014 10 01 12:25:03   | pending review       
p19397     |  2014 10 01 12:25:42   | pending assignment    
p10397     |  2014-10-01 12:26:18   | pending-acceptance

What I want it to return:

only the distinct projectid's that represent the most recent versiondate. I need to see what the newvalue was on the most recent date for each distinct projectid.

Trying to put that into words was kind of tough so hopefully I explained my question well. Any help/criticism is accepted.

Welcome to SO. Check out the [edit history](http://stackoverflow.com/posts/26786833/revisions) of your question and compare the original post with its current form. Might provide some clues for future questions. — Erwin Brandstetter, Nov 06 '14 at 20:59
My original post was atrocious. I will keep what you said in mind. — sp_conway, Nov 06 '14 at 21:52

score 0 · Accepted Answer · answered Nov 06 '14 at 19:18

This sort of question needs to be approached in multiple steps.

First, find the most reason version date for each project, then find the corresponding new value.

select a.projectid, a.max_versiondate, b.newvalue
from
    (select a.projectid, max(b.versiondate) max_versiondate
    from   a
    inner join b 
            on a.projectid = b.projectid
    where b.contractor = 'SQA Contractor Company-1'
      and a.attributename = 'Status'
      and b.versiondate between '2014-10-01 00:00:00' and '2014-10-01 23:59:59'
    group by
            a.project id) a 
inner join b
        on b.projectid = a.projectid
       and b.versiondate = a.max_versiondate

Another option is to use window aggregate methods to find the max value inline, and filter based on that.

select *
from
(
select a.projectid, versiondate, newvalue, ROW_NUMBER() OVER(PARTITION BY projectid ORDER BY versiondate DESC) rn
from a 
   inner join b on a.projectid = b.projectid 
where b.contractor = 'SQA Contractor Company-1' 
  and a.attributename = 'Status' 
  and versiondate between '2014-10-01 00:00:00' and '2014-10-01 23:59:59' 
group by a.projectid, versiondate, newvalue 
order by versiondate
)
where rn = 1

both should give you the same results, it's a matter of your indexing and table size to see which is most effective.

Wow. Option 2 worked on the first try! Took me a minute to understand what was going on but it does exactly what I'm looking for it to do. Thanks so much. — sp_conway, Nov 06 '14 at 19:29

score 0 · Answer 2 · edited May 23 '17 at 12:20

This can be simpler with the Postgres-specific DISTINCT ON.

SELECT DISTINCT ON (a.projectid)
       a.projectid, b.versiondate, b.newvalue 
FROM   a 
JOIN   b USING (projectid)
WHERE  a.attributename = 'Status' 
AND    b.contractor = 'SQA Contractor Company-1' 
AND    b.versiondate >= '2014-10-01 0:0'
AND    b.versiondate <  '2014-10-02 0:0' 
ORDER  BY a.projectid, b.versiondate DESC;

The SQL-standard DISTINCT folds duplicates on the whole row. As an extension to this Postgres implemented DISTINCT ON to fold duplicates on selected columns. For other columns in the SELECT list the value from the (consistent) first row is picked, defined by ORDER BY, or arbitrary if not well-defined.

You had an misleading halfbreed of both constructs. The parentheses in your query would make sense in combination with DISTINCT ON, but hardly die DISTINCT. Detailed explanation:

Select first row in each GROUP BY group?

Other points

You don't need GROUP BY in this case.
Ranges on timestamps are typically best expressed with including lower and excluding upper bound. BETWEEN .. AND includes both bounds and is not good for this:
- Select a date range from a timestamp column
Since you picked meaningful column names, you join conveniently join with the USING construct.

Return row info based on one distinct column value

2 Answers2

Other points