5

What would be the best way to implement approximate Disjoint Sets using SQL?

Details

I have a table of edges, stored as a two-column table of [vertex_a, vertex_b].

I need a table of distinct sets, stored as [vertex, set_id] with one row per vertex, labeling each vertex with a disjoint set_id.

Constraints

  • Must be a purely SQL implementation. It can leverage Postgres-specific functions, but pure ANSI SQL highly preferred.
  • The result can be approximate- it's acceptable to label a few sets as disjoint when they're actually connected. It's even better if the approximation bounds can be adjusted- by increasing iterations for example.
  • Libraries are out (no Boost, Numpy, Scipy). Must be SQL.
  • Most sets will contain 1 to 3 vertices. Very few large sets, expected max to be 10 vertices.

Related

Community
  • 1
  • 1
supyo
  • 3,017
  • 2
  • 20
  • 35

3 Answers3

1

I'm actually working on the same problem. Unfortunately, I don't think a very efficient solution can be found - at least not easily using just SQL. Just remove the 'distinct' and the self-eliminating query to observe how large the working sets become. That said, the following solution will work.

drop table if exists foobar;
drop function if exists addset(v int[], a int);
/* our vertices table */
create table foobar (
   src int,
   dst int
);

/* Create a small function to treat an array like a set, 
   not strictly necessary but convenient */
create function addset(v int[], a int) returns int[]
as $$
begin
    return (select array_agg(e order by e) 
                   from (select unnest(v || a) as e) f);
end
$$ language plpgsql;

/* fill our table with vertices, note the ordering of each value */
insert into foobar (src, dst) 
     values (1,2), (1,3), (2,3),  (3,4), (4,5), (6,7), (6,8);
/* use a recursive query to extend the sets */
with recursive foo_union (v) as (
    select array[src, dst] as v from foobar /* starter sets */
    union all
    /* join self to original array; i can use a CTE as a 'starter',
       but that is not necessary here */
    select addset(v, dst) as v from foo_union u, foobar f
        where src = any(v) and not dst = any(v)
 ) select distinct v from foo_union a where not exists (
     /* eliminate the many overlapping results */
     select * from foo_union b where b.v @> a.v and b.v != a.v
 );

But again, this is completely impractical on larger data sets; any other solution would require iteration.

Bart
  • 36
  • 1
1

This pure sql code grouped about 35000 records in 5 minutes (8 cores/32 gb ram). Enjoy.

--table with RELATIONS, idea is to place every related item in a bucket
create table RELATIONS
(
    bucket int,        -- initially 0
    bucketsub int,    -- initially 0
    relnr1 float,    
    relnr2 float    -- relnr1 = a, relnr2 = b means a and b are related
)

create table ##BucketRelnrs ( relnr float ); --table functions as temp list
declare @bucket int; 
declare @bucketsub int;
declare @nrOfUpdates int;
declare @relnr float;

set @bucket=0;
set @relnr=0;
WHILE @relnr>=0 and @bucket<50000 --to prevent the while loop from hanging.
BEGIN
    set @bucket = @bucket+1
    set @bucketsub=1;

    set @relnr = (select isnull(min(relnr1),-1) from RELATIONS where bucket=0); --pick the smallest relnr that has not been assigned a bucket yet
    set @nrOfUpdates = (select count(*) from RELATIONS where bucket=0 and (relnr1=@relnr or relnr2=@relnr));
    update RELATIONS set bucket=@bucket, bucketsub=@bucketsub where bucket=0 and (relnr1=@relnr or relnr2=@relnr);
    set @bucketsub = @bucketsub+1;

    WHILE @nrOfUpdates>0 and @bucketsub<=10    --to prevent the inner while loop from hanging, actually determines the number of iterations
    BEGIN
        --refill temp list with newly found related relnrs
        truncate table ##BucketRelnrs;
        insert into ##BucketRelnrs
        select distinct relnr1 from RELATIONS where bucket=@bucket
        union select distinct relnr2 from RELATIONS where bucket=@bucket;

        --calculate the number of relations that will be updated next, if zero then stop iteration
        set @nrOfUpdates =
        (
            select count(*) from RELATIONS where bucket=0
            and (relnr1 in (select relnr from ##BucketRelnrs) or relnr2 in (select relnr from ##BucketRelnrs))
        );

        --update the RELATIONS table
        update RELATIONS set bucket=@bucket, bucketsub=@bucketsub where bucket=0
        and (relnr1 in (select relnr from ##BucketRelnrs) or relnr2 in (select relnr from ##BucketRelnrs));

        set @bucketsub = @bucketsub+1;
    END
END

drop table ##BucketRelnrs; --clean temp table
  • This doesn't work on Postgres though (and ist moste definitely not "ANSI SQL") –  Mar 17 '16 at 09:55
0

If you need to maintain it incrementally, I would add a third int column called height to your vertices table, treat set_id as a parent pointer instead of connected component, and add foreign key constraints and a trigger to your relations table to do a parent pointer insert.

Then, to see connected components you can have a recursive view.

If you already have a big table and you need an efficient batch process, then the main problem is that the standard approach is not cache oblivious and has very poor cache behaviour, and SQL is basically designed to discourage that exact case.

saolof
  • 1,097
  • 1
  • 15
  • 12