from each partition, select rows with N distinct values

Question

I have a table with three columns x, y, z. I'd like to write a query that, within each PARTITION BY x, returns the rows containing the first n distinct values of y.

Here's a sample for n = 2 -- the first 2 distinct values of y in the first partition are 1 and 2, and 4 and 5 in the second partition, so all rows with those values of y are included.

 x   y   z   included?
----------------------
 1   1   1    true
 1   1   2    true
 1   2   3    true
 1   2   4    true
 1   3   5    false
 1   3   6    false
 2   4   7    true
 2   4   8    true
 2   5   9    true
 2   5  10    true
 2   6  11    false
 2   6  12    false

There's a related question that deals with selecting n rows from each partition, but that doesn't deal with the distinct values part.

score 2 · Accepted Answer · answered Sep 30 '14 at 22:00

2

I"m not sure what you mean by "first". SQL tables represent unordered sets. So, I'll assume you mean "smallest".

You can do this using dense_rank():

select t.*
from (select t.*, dense_rank() over (partition by x order by y) as seqnum
      from atable t
     ) t
where seqnum <= 2;

answered Sep 30 '14 at 22:00

Gordon Linoff

1,242,037
58
646
786

What if I'm instead selecting from a subquery that specifies some arbitrary ordering, and want to take the first `n` of those? – rcrogers Oct 01 '14 at 01:44
@rcrogers . . . You would put the ordering into the `order by` clause. – Gordon Linoff Oct 01 '14 at 02:04
Is there a way for the outer query to respect the inner query's ordering, without the outer query also specifying it? – rcrogers Oct 01 '14 at 02:20
1

@rcrogers . . . No. If you want a result set in a particular order, then you should use an `order by` at the outermost level. – Gordon Linoff Oct 01 '14 at 02:33

Laurence · Answer 2 · 2014-09-30T22:19:17.840

You can use a combination of desnse_rank and row_number to eliminate the duplicates:

with a as (
  select
    x, y, z,
    dense_rank() over (partition by x order by y) rk,
    row_number() over (partition by x, y order by z) rn
  from
    t
) select
  x, y, z
from
  a
where
  rk <= 2 and
  rn = 1;

this generates the 1, 2, 4, 5

from this you can get the desired results by joining back onto t:

with a as (
  select
    x, y, z,
    dense_rank() over (partition by x order by y) rk,
    row_number() over (partition by x, y order by z) rn
  from
    t
) select
  t.*
from
  t
where
  exists (
    select
      'x'
    from
      a
    where
      a.y = t.y and
      a.rk <= 2 and
      a.rn = 1
  );

Example SQLFiddle

Although, using exists in this way, makes the duplicates irrelevant, so you can just do:

with a as (
  select
    x, y, z,
    dense_rank() over (partition by x order by y) rk
  from
    t
) select
  t.*
from
  t
where
  exists (
    select
      'x'
    from
      a
    where
      a.y = t.y and
      a.rk <= 2
  );

Example SQLFiddle

from each partition, select rows with N distinct values

2 Answers2