How to select data based on Multiple Unique Columns without applying aggregate functions on rest of columns in resultset

Question

This might seem duplicate but I could not find an answer that matches my requirements. Referred here but its not quite the same and answers does not apply to me. Hence asking.

SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4 
from 
     (SELECT col1, col2, col3 
      FROM table1 
      GROUP BY col1, col2, col3) AS tab1
JOIN tab2 
  ON tab1.col1 = tab2.col1 
 AND tab1.col2 = tab2.col2
GROUP BY tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4

Example Data:

Table1                          Tab2            
col1  | col2   |   col3         col1  |  col2 | col3  |  col4
=======================         =============================
page1   image1  referer1        page1   image1  150      75
page1   image1  referer1        page1   image1  120      85
page2   image2  referer2        page2   image2  200      400
page1   image1  referer1        page1   image1  750     1024
page2   image2  referer2        page2   image2  450      575
page1   image1  referer1        page1   image1  600      900

Expected Output:

tab1.col1  |  tab1.col2  |  tab1.col3  |  tab2.col3  |   tab2.col4
==================================================================
  page1       image1        referer1        600            900
  page2       image2        referer2        200            400

Here the last group by returns unique rows but all the tab1 cols are repeating which I don't want and in this case the records only vary on tab2.col1 and tab2.col2. Now the requirement is that I want tab1.col1, tab1.col2, tab1.col3 to be unique and only a pair of respective tab2.col1 and tab2.col2 for those columns from tab2. Now I cannot remove the tab2.col1, tab2.col2 from the second group by and apply a min or max aggregate function because in that case I wont get the tab2.col1 value which maps to the tab2.col2 value for a particular record.

Note: I am using Amazon Redshift as DB. The subquery is important since that is actually a complicated result set produced from 3 table join and if I use a join of these 3 tables directly with tab2 then the query runs forever. For the sake of simplicity of this question let assume that the subquery returns col1, col2, col3 from table1. Tab2 and tab1 are ginormous tables :D... the subquery improves performance considerably (reduced 20mins to ~2mins).

there is nothing complicated in your subquery. and query itself seem very unclear. try to set sqlfiddle and explain your goal with some data set and expected result you need. — Alex, Nov 02 '16 at 18:30
Please read [**How-to-Ask**](http://stackoverflow.com/help/how-to-ask) And here is a great place to [**START**](http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/) to learn how improve your question quality and get better answers. — Juan Carlos Oropeza, Nov 02 '16 at 18:36
@Alex Thanks for responding. To make the question simple I have put the subquery as a simple select. Its actually a select from 3 huge tables. For simplicity lets assume that the subquery produces col1, col2, col3 from table1 which I join to tab2 for the final resultset. My real question is selecting unique values for tab1.col1, tab1.col2, tab1.col3 with a respective tab2.col1, tab2.col2 pair. How to do this? — Roy, Nov 02 '16 at 18:40
@JuanCarlosOropeza forget the subquery. Its not important. My real question is selecting unique values for tab1.col1, tab1.col2, tab1.col3 with a respective tab2.col1, tab2.col2 pair. How to do this? — Roy, Nov 02 '16 at 18:42
Show us sample data and expected output. [**How to create a Minimal, Complete, and Verifiable example**](http://stackoverflow.com/help/mcve) — Juan Carlos Oropeza, Nov 02 '16 at 19:32
@JuanCarlosOropeza I have modified the post with example data. Hope this clears things out. ~Thanks!! — Roy, Nov 02 '16 at 20:33
@JuanCarlosOropeza You can get any pair of [col3,col4]. For example [150, 75] and [120, 85] are both valid since the tab1 cols are same. But you cant mix them which is [150, 85] is not valid. This is why I can't use any aggregate functions. — Roy, Nov 02 '16 at 21:38
Your example grabs the "last" row in each group. See [_Groupwise max_](https://mariadb.com/kb/en/mariadb/groupwise-max-in-mariadb/) for a related topic. Probably Windowing functions of MariaDB 10.2 would also help. — Rick James, Nov 02 '16 at 22:11
@RickJames This is required to be done in Amazon Redshift and not Maria DB. Please provide solution coherent to the technology is use. ~Thanks — Roy, Nov 02 '16 at 22:37

score 0 · Accepted Answer · answered Nov 04 '16 at 06:46

If below result is your requirement,

tab1.col1  |  tab1.col2  |  tab1.col3  |  tab2.col3  |   tab2.col4
==================================================================
  page1       image1        referer1        120            85
  page2       image2        referer2        200            400

then you can achieve it using below redshift sql query

SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4
FROM
  (SELECT col1, col2, col3
   FROM table1
   GROUP BY col1, col2, col3) AS tab1
JOIN
  (SELECT col1, col2, col3, col4
   FROM
     (SELECT col1, col2, col3, col4, ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC) row_num
      FROM table2) tab2
   WHERE row_num = 1) tab2 ON tab1.col1 = tab2.col1
      AND tab1.col2 = tab2.col2

The key in the above sql statement is "ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC)". This will give you min value of tab2.col3 and min value of tab2.col4. If you need max value, then change the order to DESC

Hope this should solve your question.

You can also check the result-set here which I have written

How to select data based on Multiple Unique Columns without applying aggregate functions on rest of columns in resultset

1 Answers1