1

Loooking for a way to retrieve community from a large dataset I came across an article about the algorithm which seems to be apropriate for large datasets. Anyway the data is stored two tables: users (nodes) and connections and I would like to retrieve the communities by pure sql queries without help of custom applications (I'm using SQL Server 2008).

The algorithm to retrieve the cliques is the following:

Read the graph G
Generate set neighbors(v) for every vertex of G
for each vertex v of G
call recursive_find_cliques(v, neighbors(v))
end for

Function recursive_find_cliques(x, n)
for each vertex t ∈ n by ascending order calculate set sigma
if sigma is not empty
extend x with t
call recursive_find_cliques(x, sigma)
end if
end for

where sigma is the set of vertices that could constitute triangles with v and its neighbors.

I already created a stored procedure which returns a table of neighbors of selected node but so far I haven't delat with sql functions and advanced queries so the question is the following:

Does anyone know how to rewrite the algorithm above in sql in order to get the set of cliques? As the question might be a little abstract, I may point out that the main problem is to create a recursive function (recursive_find_cliques(x, n)) which takes a table (n) as an argument).

Thank you!

EDIT:

Here is sthe stored procedure created so far:

CREATE PROCEDURE [dbo].[Peamc_Test]
AS
BEGIN

SET XACT_ABORT ON
BEGIN TRAN

SET NOCOUNT ON;

CREATE TABLE #Users
(
UserId int NOT NULL,
userLabel varchar(50) PRIMARY KEY NOT NULL,
Observed bit NOT NULL
)

CREATE TABLE #Neighbors
(
UserId int NOT NULL,
userLabel varchar(50) NOT NULL PRIMARY KEY,
Retrieved bit NOT NULL
)

CREATE TABLE #ConnectedVertices
(
UserId int NOT NULL,
userLabel varchar(50) NOT NULL PRIMARY KEY,
)

CREATE TABLE #Cliques
(
CliqueId int NOT NULL,
UserId varchar(50) NOT NULL,
)

DECLARE @UsersCount int
DECLARE @ii int
DECLARE @User varchar(50)
DECLARE @NeighborsCount int

INSERT INTO #Users(UserId, userLabel, Observed) SELECT user_id, userLabel, 0 FROM dbo.test_users WHERE user_id IS NOT NULL
SELECT @UsersCount = COUNT(*) FROM #Users
SELECT @ii = 1
WHILE @ii <= @UsersCount
BEGIN
--select user
SELECT TOP 1 @User = userLabel FROM #Users WHERE Observed = 0 ORDER BY UserId

UPDATE #Users SET Observed = 1 WHERE userLabel = @User

--Get user's neighbors
DELETE FROM #Neighbors
INSERT INTO #Neighbors(UserId, userLabel, Retrieved)
SELECT u.user_id, t2.neighbor, 0 FROM ( SELECT CALLING_NEIGHBORS.neighbor FROM ( SELECT mc.calling_party AS neighbor FROM monthly_connections_test mc WHERE mc.called_party = @User) AS CALLING_NEIGHBORS INNER JOIN (SELECT mc.called_party AS neighbor FROM monthly_connections_test mc WHERE mc.calling_party = @User) AS CALLED_NEIGHBORS ON CALLING_NEIGHBORS.neighbor = CALLED_NEIGHBORS.neighbor) AS t2 INNER JOIN test_users u ON t2.neighbor = u.userLabel
SELECT @NeighborsCount = COUNT(*) FROM #Neighbors
SELECT @ii = @ii + 1
--HERE the function recursive_find_cliques has to search for cliques and insert the found ones in #cliques

END

SELECT * FROM #Cliques

END

It does'not return anything yet as it is not finished. It though retrieves all neighbors for the currently selected nodes and the next step is to implement recursive_find_cliques function.

Niko Gamulin
  • 66,025
  • 95
  • 221
  • 286
  • 1
    Please post the code you have written so far. People generally do not like to just write your code for you. – Mitch Wheat Nov 10 '10 at 11:06
  • What DBMS are you using? SQL defines a syntax for recursive queries which could be useful here; PostgreSQL, Firebird, SQL Server, and i believe DB2 support it, Oracle has an alternative synax, and MySQL and other non-serious databases just don't. – Tom Anderson Nov 10 '10 at 12:15

3 Answers3

1

I realised that my first answer only works when each clique has at least one user who is not referred to by any others in that clique. In other words, closed cliques like A-B, B-C, C-A will not be found.

Here is a solution which solves this. Again we have users with IDs, now 1..20. There are several cases of neighbouring relations that need to be handled:

alt text

Compared to the simple case, it is harder to find a unique starter for each clique. We achieve this with a little sleight of hand:

  • Reorder the neighbours so that for all references A-B, A is less than B, ignoring any A=B.

  • From these, remove any A-X references if there are any X-A, which could cause a loop. This will never remove references to A completely because X-A remains and A-X will be added in the recursion.

The resultant set are the 'starting' users and we use them to prime the CTE:

-- Get all pairs, where UserA < UserB, dropping any A=B or B=A
WITH LRNeighbours(A, B) AS (
    SELECT
        Neighbours.UserA, Neighbours.UserB
    FROM
        Neighbours
    WHERE
        Neighbours.UserA < Neighbours.UserB
UNION ALL
    SELECT DISTINCT
        Neighbours.UserB, Neighbours.UserA
    FROM
        Neighbours
    WHERE
        Neighbours.UserA > Neighbours.UserB
),
-- Isolate those that are not referred to by a higher numbered key
Starters(userid) AS (
    SELECT DISTINCT
        A
    FROM    
        LRNeighbours
    WHERE 
        A NOT IN (
            SELECT 
                B
            FROM
                LRNeighbours
        )
),
-- The recursive Common Table Expression
cliques(userid, clique) AS (
    -- Number starters 1..N
    SELECT  
        userid, ROW_NUMBER() OVER(ORDER BY userid) AS clique
    FROM
        Starters
UNION ALL
    -- Recurse, adding users referred by siblings, avoiding starters themselves
    SELECT  
        B, clique 
    FROM
        LRNeighbours INNER JOIN 
        cliques ON
            LRNeighbours.A = cliques.userid 
            AND B NOT IN (
                SELECT
                    userid
                FROM
                    starters
            )
)
SELECT DISTINCT
    clique, userid 
FROM 
    cliques 
ORDER BY 
    clique, userid 

Results:

1   1
1   2
2   3
2   4
3   5
3   6
3   7
3   8
4   9
4   10
4   11
4   12
4   13
5   14
5   15
5   16
5   17
5   18
5   19
5   20
smirkingman
  • 6,167
  • 4
  • 34
  • 47
  • Thank you smirkingman for the comprehensive explanation! The suggested solution is very helpful to learn about the recursion logic in graphs. The cliques however were considered to be a fully connected subgraphs so I have to modify the code above. However as already mentioned it is helpful indeed so thank you again! – Niko Gamulin Nov 11 '10 at 21:52
  • Glad to have been of assistance. As pointed out above, you will get better feedback if you accept answers (with the green tick), rather than just marking +1. – smirkingman Nov 12 '10 at 13:24
0
CREATE TABLE [dbo].[Users](
    [UserID] [int] IDENTITY(1,1) NOT NULL,
    [UserName] [varchar](50) NOT NULL
) ON [PRIMARY]
CREATE TABLE [dbo].[Neighbours](
    [UserA] [int] NOT NULL,
    [UserB] [int] NOT NULL
) ON [PRIMARY]

Users populated with 1..8 and Neighbours

UserA   UserB
1   2
2   3
4   5
4   6
5   7
7   8

Then:

WITH cliques(userid, clique) AS (
    SELECT  
        userid, ROW_NUMBER() OVER(ORDER BY userid) AS clique
    FROM
        Users
    WHERE
        users.UserID NOT IN (
            SELECT 
                UserB
            FROM
                Neighbours
        )
UNION ALL
    SELECT  
        Neighbours.UserB, clique 
    FROM
        neighbours 
        INNER JOIN cliques
            ON Neighbours.UserA = cliques.userid
)
SELECT
    clique, cliques.userid 
FROM 
    cliques
ORDER BY 
    clique, userid 

Result:

clique  userid
1   1
1   2
1   3
2   4
2   5
2   6
2   7
2   8

See : Recursive Queries Using Common Table Expressions

smirkingman
  • 6,167
  • 4
  • 34
  • 47
0

I've added a two LABELS and two GOTO statements

CREATE PROCEDURE [dbo].[Peamc_Test] 
AS 
BEGIN 

SET XACT_ABORT ON 
BEGIN TRAN 

SET NOCOUNT ON; 

CREATE TABLE #Users 
( 
UserId int NOT NULL, 
userLabel varchar(50) PRIMARY KEY NOT NULL, 
Observed bit NOT NULL 
) 

CREATE TABLE #Neighbors 
( 
UserId int NOT NULL, 
userLabel varchar(50) NOT NULL PRIMARY KEY, 
Retrieved bit NOT NULL 
) 

CREATE TABLE #ConnectedVertices 
( 
UserId int NOT NULL, 
userLabel varchar(50) NOT NULL PRIMARY KEY, 
) 

CREATE TABLE #Cliques 
( 
CliqueId int NOT NULL, 
UserId varchar(50) NOT NULL, 
) 

DECLARE @UsersCount int 
DECLARE @ii int 
DECLARE @User varchar(50) 
DECLARE @NeighborsCount int 

INSERT INTO #Users(UserId, userLabel, Observed) SELECT user_id, userLabel, 0 FROM dbo.test_users WHERE user_id IS NOT NULL 
SELECT @UsersCount = COUNT(*) FROM #Users 
SELECT @ii = 1 
WHILE @ii <= @UsersCount 
BEGIN 
--select user 
SELECT TOP 1 @User = userLabel FROM #Users WHERE Observed = 0 ORDER BY UserId 

UPDATE #Users SET Observed = 1 WHERE userLabel = @User 

--Get user's neighbors 
DELETE FROM #Neighbors 
INSERT INTO #Neighbors(UserId, userLabel, Retrieved) 
SELECT u.user_id, t2.neighbor, 0 FROM ( SELECT CALLING_NEIGHBORS.neighbor FROM ( SELECT mc.calling_party AS neighbor FROM monthly_connections_test mc WHERE mc.called_party = @User) AS CALLING_NEIGHBORS INNER JOIN (SELECT mc.called_party AS neighbor FROM monthly_connections_test mc WHERE mc.calling_party = @User) AS CALLED_NEIGHBORS ON CALLING_NEIGHBORS.neighbor = CALLED_NEIGHBORS.neighbor) AS t2 INNER JOIN test_users u ON t2.neighbor = u.userLabel 
SELECT @NeighborsCount = COUNT(*) FROM #Neighbors 
SELECT @ii = @ii + 1 
GOTO Clique_Find
--HERE the function recursive_find_cliques has to search for cliques and insert the found ones in #cliques 
--------------------
Clique_Return:
--------------------

END 

SELECT * FROM #Cliques 

END 

--------------------
Clique_Find:
--------------------
-- Code goes here
-- Code goes here
-- Code goes here
-- Code goes here
-- Code goes here
-- Code goes here
GOTO Clique_Return