5

I have the following table in Postgres that has overlapping data in the two columns a_sno and b_sno.

create table data
( a_sno integer not null,  
  b_sno integer not null,
  PRIMARY KEY (a_sno,b_sno)
);

insert into data (a_sno,b_sno) values
  ( 4, 5 )
, ( 5, 4 )
, ( 5, 6 )
, ( 6, 5 )
, ( 6, 7 )
, ( 7, 6 )
, ( 9, 10)
, ( 9, 13)
, (10, 9 )
, (13, 9 )
, (10, 13)
, (13, 10)
, (10, 14)
, (14, 10)
, (13, 14)
, (14, 13)
, (11, 15)
, (15, 11);

As you can see from the first 6 rows data values 4,5,6 and 7 in the two columns intersects/overlaps that need to partitioned to a group. Same goes for rows 7-16 and rows 17-18 which will be labeled as group 2 and 3 respectively.

The resulting output should look like this:

group | value
------+------
1     | 4
1     | 5
1     | 6
1     | 7
2     | 9
2     | 10
2     | 13
2     | 14
3     | 11
3     | 15
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
bogeyman
  • 169
  • 1
  • 7

3 Answers3

5

Assuming that all pairs exists in their mirrored combination as well (4,5) and (5,4). But the following solutions work without mirrored dupes just as well.

Simple case

All connections can be lined up in a single ascending sequence and complications like I added in the fiddle are not possible, we can use this solution without duplicates in the rCTE:

I start by getting minimum a_sno per group, with the minimum associated b_sno:

SELECT row_number() OVER (ORDER BY a_sno) AS grp
     , a_sno, min(b_sno) AS b_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
   SELECT 1 FROM data
   WHERE  b_sno = d.a_sno
   AND    a_sno < b_sno
   )
GROUP  BY a_sno;

This only needs a single query level since a window function can be built on an aggregate:

Result:

grp  a_sno  b_sno
1    4      5
2    9      10
3    11     15

I avoid branches and duplicated (multiplicated) rows - potentially much more expensive with long chains. I use ORDER BY b_sno LIMIT 1 in a correlated subquery to make this fly in a recursive CTE.

Key to performance is a matching index, which is already present provided by the PK constraint PRIMARY KEY (a_sno,b_sno): not the other way round (b_sno, a_sno):

WITH RECURSIVE t AS (
   SELECT row_number() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, min(b_sno) AS b_sno  -- the smallest one
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   GROUP  BY a_sno
   )

, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp
       , (SELECT b_sno  -- correlated subquery
          FROM   data
          WHERE  a_sno = c.sno
          AND    a_sno < b_sno
          ORDER  BY b_sno
          LIMIT  1)
   FROM   cte  c
   WHERE  c.sno IS NOT NULL
   )
SELECT * FROM cte
WHERE  sno IS NOT NULL   -- eliminate row with NULL
UNION  ALL               -- no duplicates
SELECT grp, a_sno FROM t
ORDER  BY grp, sno;

Less simple case

All nodes can be reached in ascending order with one or more branches from the root (smallest sno).

This time, get all greater sno and de-duplicate nodes that may be visited multiple times with UNION at the end:

WITH RECURSIVE t AS (
   SELECT rank() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, b_sno  -- get all rows for smallest a_sno
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   )   
, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp, d.b_sno
   FROM   cte  c
   JOIN   data d ON d.a_sno = c.sno
                AND d.a_sno < d.b_sno  -- join to all connected rows
   )
SELECT grp, sno FROM cte
UNION                     -- eliminate duplicates
SELECT grp, a_sno FROM t  -- add first rows
ORDER  BY grp, sno;

Unlike the first solution, we don't get a last row with NULL here (caused by the correlated subquery).

Both should perform very well - especially with long chains / many branches. Result as desired:

SQL Fiddle (with added rows to demonstrate difficulty).

Undirected graph

If there are local minima that cannot be reached from the root with ascending traversal, the above solutions won't work. Consider Farhęg's solution in this case.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • a great answer with great explanations, I especially like such these explanations which makes SO much better – void Apr 20 '15 at 03:04
  • and also the clever point with this answer is the recursive cte does not involve in a loop so there is no need to use a cycle flag and path array – void Apr 20 '15 at 03:23
  • @Farhęg: However, I just realized that I might have over-optimized with LIMIT 1. The query might miss members of the group, depending on the actual characteristics of the data ... adding a corrective ... – Erwin Brandstetter Apr 20 '15 at 03:27
  • Erwin, you solution works better than the other 2 solutions from a performance standpoint. However, only on small datasets upto 50K rows. I am applying this logic on a 2M rows table and the performance of the query is very bad. Is there a way to improve the performance? Your help is much appreciated. – bogeyman May 02 '15 at 22:54
  • @bogeyman: Which of my answers fits your scenario? I have made assumptions and provided multiple answers because the question didn't provide all the details. The cardinalities you mention should be *in the question*. For performance optimization, you need to provide full details. [Consider guidelines here.](http://stackoverflow.com/tags/postgresql-performance/info) Update your question with missing details or - if that would change the nature of the question - start a new one. Finally, if my answer works best, why is the other one still accepted? – Erwin Brandstetter May 02 '15 at 23:09
  • Erwin, Thank you. Marking your question as answer.I had selected the previous one as it was the first one that seemed to work initially. But after several tries I found the first two solutions not able to handle more than 3000 rows. The code that I am using is your "Less Simple case". – bogeyman May 03 '15 at 16:47
  • 1
    Erwin, Never mind. I tried your first solution. This works like a charm. I didn't pay much attention to your note "Unlike the first solution, we don't get a last row with NULL here ". That was the problem. Thanks for your help. – bogeyman May 03 '15 at 17:14
4

I want to say another way, it may be useful, you can do it in 2 steps:

1. take the max(sno) per each group:

  select q.sno,
         row_number() over(order by q.sno) gn
  from(
    select distinct d.a_sno sno
    from data d
    where not exists (
      select b_sno
      from data
      where b_sno=d.a_sno
      and a_sno>d.a_sno
     )
    )q

result:

sno gn
7   1
14  2
15  3

2. use a recursive cte to find all related members in groups:

with recursive cte(sno,gn,path,cycle)as(
  select q.sno,
         row_number() over(order by q.sno) gn,
         array[q.sno],false
  from(
    select distinct d.a_sno sno
    from data d
    where not exists (
      select b_sno
      from data
      where b_sno=d.a_sno
      and a_sno>d.a_sno
     )
    )q
  union all
  select d.a_sno,c.gn,
         d.a_sno || c.path,
         d.a_sno=any(c.path)
  from data d
  join cte c on d.b_sno=c.sno 
  where not cycle
 )
 select distinct gn,sno from cte
 order by gn,sno

Result:

gn  sno
1   4
1   5
1   6
1   7
2   9
2   10
2   13
2   14
3   11
3   15

here is the demo of what I did.

void
  • 7,760
  • 3
  • 25
  • 43
2

Here is a start that may give some ideas on an approach. The recursive query starts with a_sno of each record and then tries to follow the path of b_sno until it reaches the end or forms a cycle. The path is represented by an array of sno integers.

The unnest function will break the array into rows, so a sno value mapped to the path array such as:

4, {6, 5, 4}

will be transformed to a row for each value in the array:

4, 6
4, 5
4, 4

The array_agg then reverses the operation by aggregating the values back into a path, but getting rid of the duplicates and ordering.

Now each a_sno is associated with a path and the path forms the grouping. dense_rank can be used to map the grouping (cluster) to a numeric.

SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
      ,sno
  FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
             SELECT a_sno, ARRAY[a_sno], false FROM  data
             UNION ALL
             SELECT b_sno, path || b_sno, b_sno = ANY(path)
               FROM data, x
               WHERE a_sno = x.sno
                 AND NOT cycle
         )
         SELECT sno, unnest(path) AS map FROM x ORDER BY 1
       ) y
  GROUP BY sno
  ORDER BY 1, 2

Output:

   cluster    | sno 
--------------+-----
 {4,5,6,7}    |   4
 {4,5,6,7}    |   5
 {4,5,6,7}    |   6
 {4,5,6,7}    |   7
 {9,10,13,14} |   9
 {9,10,13,14} |  10
 {9,10,13,14} |  13
 {9,10,13,14} |  14
 {11,15}      |  11
 {11,15}      |  15
(10 rows)

Wrap it one more time for the ranking:

SELECT dense_rank() OVER(order by cluster) AS rank
      ,sno
  FROM (
    SELECT array_agg(DISTINCT map ORDER BY map) AS cluster
          ,sno
      FROM ( WITH RECURSIVE x(sno, path, cycle) AS (
                 SELECT a_sno, ARRAY[a_sno], false FROM  data
                 UNION ALL
                 SELECT b_sno, path || b_sno, b_sno = ANY(path)
                   FROM data, x
                   WHERE a_sno = x.sno
                     AND NOT cycle
             )
             SELECT sno, unnest(path) AS map FROM x ORDER BY 1
           ) y
      GROUP BY sno
      ORDER BY 1, 2
) z

Output:

 rank | sno 
------+-----
    1 |   4
    1 |   5
    1 |   6
    1 |   7
    2 |   9
    2 |  10
    2 |  13
    2 |  14
    3 |  11
    3 |  15
(10 rows)
Glenn
  • 8,932
  • 2
  • 41
  • 54