How to join two tables while removing repeated entries in one column of one table

Question

I'm not very familiar with SQL and have hit a problem with duplicate-removal that I need some help with.

I have two tables as follows:

config -> ID  -----------
       -> timestamp      |
       -> vcsID          |
       -> platform       |
                         |
data   -> ID             |
       -> configID <-----
       -> value

For each row in the config table there may be multiple rows in the data table, the join being:

data.configID = config.ID

However, in the config table there may be entries in which vcsID has not advanced (these represent builds on the same code from our version control system, and 'value' represents a build metric such as memory size); they are not duplicate rows, because the timestamp will be different, but I need to get rid of them. I want to create a view on the data that removes these duplicates. How do I go about doing that? In MySQL I can do the following:

select *
from (select * from config group by vcsID) as filtered
inner join data
on data.configID = filtered.ID

...but Postgres is strictly compliant with the SQL standard and requires that anything in the Select clause must be in the Group By clause so I can't get the ID field out of the sub-query to do the join.

As an example, some sample data might be:

config
ID    timestamp     vcsID   platform
1   1/1/2014 09:00    18       "x"
2   1/1/2014 20:20    30       "y"
3   1/1/2014 20:25    30       "y"
4   1/1/2014 20:40    31       "y"

data
ID  configID   value
12      1      40000
13      2       125
14      3       125
15      4       130

...and, taking platform "y" as the thing I'm interested in, what I'd like to get out is:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
2   1/1/2014 20:20    30      "y"     |  13      2     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

or, equivalently:

               config                 |         data
ID    timestamp     vcsID  platform   |  ID  configID value
3   1/1/2014 20:25    30      "y"     |  14      3     125
4   1/1/2014 20:40    31      "y"     |  15      4     125

Any help is greatly appreciated.

To get the `ID` out, just change to `from (select vcsID from config group by vcsID)`. To get the rest of the columns do a second join (third table) to config joining on configID. — crthompson, Aug 07 '14 at 15:29
You can use min or max of a value if you don't want it in the the group by -- this used to drive me nuts when I transitioned from MySQL to Postgres, though I now regard it as a very good thing. You can also do what you want with table expressions using row_number() over() and partition by. — John Powell, Aug 07 '14 at 15:33
@paqogomez: I see that I can remove the duplicates by using `from (select vcsID from config group by vcsID)` but the vcsID field isn't in the data table, it is the config.ID field that links to data.configID. Without getting the config.ID out how do I join the two? I guess what you might be saying is that we have to redesign the schema so that vcsID goes in the data table also. — Rob, Aug 07 '14 at 15:39
@John Barça: how would I use min() and max() to remove the duplicates though? — Rob, Aug 07 '14 at 15:40
Your comment that postgres is more strict and you need all columns in the group by, well, you can use min/max for those that you don't want in group by and use group by on everything else to get a unique combination. Table expressions are cleaner, but harder to get your head round initially. — John Powell, Aug 07 '14 at 15:59
`data.ID = config.ID`? Contradicts the table definition. I assume you mean `data.configID = config.ID`? — Erwin Brandstetter, Aug 07 '14 at 20:20

score 2 · Accepted Answer · edited May 23 '17 at 11:49

Use DISTINCT ON:

SELECT DISTINCT ON (vcsID) *
FROM   config c
JOIN   data   d ON d.configID = c.ID
ORDER  BY vcsID, "timestamp" DESC;

Assuming you want to pick the latest row from each group of identical vcsID, thus the ORDER BY. If you really don't care which row you get for each vcsID, you don't need ORDER BY. Either way, the leading columns in ORDER BY have to match DISTINCT ON expressions, so you cannot ORDER BY c.id, like you seem to want. You'd need to wrap this in a sub-query and order in the outer query.

Detailed explanation for DISTINCT ON and alternative solutions:

Select first row in each GROUP BY group?

Aside: don't use basic type names like timestamp as identifiers.

That's the answer! DISTINCT ON () is my new favourite phrase :-). — Rob, Aug 07 '14 at 22:30

NullEverything · Answer 2 · 2014-08-07T20:05:40.090

0

Try this query:

select distinct on (vscID) * from config c join data d on d.configID = c.ID where c.platform = "y"

It should give you what you are looking for. The "distinct on (column...)" removes duplicates based on specific column(s).

edited Aug 07 '14 at 20:05

answered Aug 07 '14 at 19:59

NullEverything

450
2
5

How to join two tables while removing repeated entries in one column of one table

2 Answers2