SQL Algorithm for grouping all "equivalent" strings

Question

I have an input table with two columns, each holding a string representing an id_number, and these columns are called id1 and id2. The two id_numbers that appear as a pair in any given row are defined as being equivalent to each other. If one of those id_numbers also appears in another row, then then all the strings in both rows are equivalent to eachother, etc. The goal is to return a table with two columns, one containing all unique id_numbers and another identifying their grouping by equivalency.

Sample Data: `

create table input_table (
id1 varchar(100),
id2 varchar(100)
)

insert into input_table(id1,id2)
values
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m')

Expected Output:

| Id | Grouping |
| a  | 1        |
| b  | 1        |
| c  | 1        |
| d  | 1        |
| f  | 2        |
| g  | 2        |
| k  | 2        |
| l  | 3        |
| m  | 3        |

To further explain the results:

Row 1 tells us a=b so they are assigned to group 1
Row 2 tells us b=c, since b is already in group 1, c is also assigned to group 1
Row 3 tells us d=a, since a is already in group 1, d is also assigned to group 1
Row 4 tells us a=b, which we already know so we don't need to do anything
Row 5 tells us f=g, since neither are in an existing group, we assign them to group 2
etc.

``SQL is not a language for implementing _algorithms_ because SQL is incapable of expressing a sequence of operations: it is capable of only describing a _single_ tabular query. Just sayin'`` — Dai, Feb 10 '23 at 00:22
_"The two id_numbers that appear as a pair in any given row are defined as being equivalent to each other."_ - what does that mean, exactly? (And using passive-voice does not help intelligiblity here...) — Dai, Feb 10 '23 at 00:24
1) Its extremely confusing that you talk extensively about id *numbers* which are in fact strings, not numbers. 2) Your logic doesn't make sense (to me at least) - I have no idea why g is ranked 2 for example, or l is ranked 3. — Dale K, Feb 10 '23 at 00:24
To answer most of the above: id_numbers is a misnomer and i should have said id_strings...these are strings that represent a real world entity. When i say equivalent, i mean they represent the same real world entity, so even though the strings are different, they represent the same thing. I am trying to "group" strings together by the real world entity they represent. — Nick Calenti, Feb 10 '23 at 00:30
To further explain the results: Row 1 tells us a=b so they are assigned to group 1 Row 2 tells us b=c, since b is already in group 1, c is also assigned to group 1 Row 3 tells us d=a, since a is already in group 1, d is also assigned to group 1 Row 4 tells us a=b, which we already know so we dont need to do anything Row 5 tells us f=g, since neither are in an existing group, we assign them to group 2 etc. — Nick Calenti, Feb 10 '23 at 00:33
Rather than adding comments [edit] your question to provide clarification. — Dale K, Feb 10 '23 at 00:36
You should read a book on database design. What you have now is a very poor design, and as such is going to be a nightmare to do anything with at all. — Ken White, Feb 10 '23 at 00:55
This can help: https://stackoverflow.com/q/23092901/11695049 — tinazmu, Feb 10 '23 at 01:06
I wrote an answer to a similar question a while ago: [How to find all connected subgraphs of an undirected graph](https://stackoverflow.com/q/35254260/4116017) — Vladimir Baranov, Mar 16 '23 at 06:13

Dai · Answer 1 · 2023-02-10T15:12:47.273

Your table describes a nondirected graph, where the set (distinct values) of all Id1 and Id2 values over all tuples _("rows") represent nodes and the tuples themselves represent the node's edges (as they connect - or link - these Id values in a relationship) - which means we can apply graph-theory techniques to solve your problem, which can be stated as finding disconnected components and assigning them an identifier) within your graph (as each disconnected component represents each set of connected nodes).

_{...with that formalism out of the way, I'll say I don't think it's necessarily correct to use the word "equivalent" or "equivalence": just because a binary relation (each of your tuples) is transitive (i.e. nondirected node edges are a transitive relation) it says nothing about what the nodes themselves represent, so I'm guessing you meant to say the Id1/Id2 values represent things-that-are-equivalent in your problem-domain, which is fair, but in the context of my answer I'll refrain from using that term (not least because "set equivalence" is something else entirely)}

ANYWAY...

Step 1: Normalize your bad data:

Assuming that those Id1/Id2 values are comparable values (as in: every value can be sorted consistently and deterministically by a less-than comparison), such as strings or integers, then do that first so we can generate a normalized representation of your data.

This normalized representation of your data would be the same as your current data, except that:

Every row's Id1 value must be < (i.e. less-than) its Id2 value.
There are no duplicate rows.

So if there's a row where Id2 < Id1 you should swap the values, so ( 'd', 'a' ) becomes ( 'a', 'd' ), and the duplicate ('a', 'b') row would be removed.

This also means that every (disconnected) component in the graph we're representing now has a minimum node which we can treat as the (arbitrary) "root" of that component - which means we will have a target to look for from any node (but we don't know which values are minimums in each component yet, hang on...)

To aid identification, let's call the smallest value in a pair the smol value and the big value.

(You can do this step as another CTE step, but for clarity I'm using a table-variable @normalized, like so):

DECLARE @horribleInputData TABLE (
    Id1 char(1) NOT NULL,
    Id2 char(1) NOT NULL
);

INSERT INTO @horribleInputData ( Id1, Id2 ) VALUES
( 'a', 'b' ),
( 'b', 'c' ),
( 'd', 'a' ),
( 'a', 'b' ),
( 'f', 'g' ),
( 'f', 'k' ),
( 'l', 'm' );

--

DECLARE @normalized TABLE (
    Smol char(1) NOT NULL,
    Big  char(1) NOT NULL,

    INDEX IDX1 ( Smol ),
    INDEX IDX2 ( Big  ),

    CHECK ( Smol < Big )
);

INSERT INTO @normalized ( Smol, Big )
SELECT
    DISTINCT -- Exclude duplicate rows.
    CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol, -- Sort the values horizontally, not vertically.
    CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
    @horribleInputData
WHERE
    Id1 <> Id2;  -- Exclude zero-information rows.

After running the above, the @normalized table looks like this:

Smol	Big
a	b
a	d
b	c
f	g
f	k
l	m

Step 2: Designate the minimum node as root in each disconnected component:

If we treat the node with the minimum-value as the "root" in a directed graph (or in this case, the minimum node in each disconnected component) then we can find connected nodes by finding a path from each node to that root. But what else defines a root? In this case, it would be a node with no "outgoing" references from Smol to Big - so we can simply take @normalized and do an anti-join to itself (i.e. for every Smol, see if that Smol is a Big to any other Smol - if none exist then that Smol is the Smollest - and is therefore a root:

SELECT
    DISTINCT
    l.Smol AS Smollest
FROM
    @normalized AS l
    LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
WHERE
    r.Big IS NULL;

Which gives us this output:

Smollest
a
f
l

...which tells us that there's 3 disconnected-components in this graph, and what those minimum nodes are.

Step 3: the srsbsns part

Which means we can now try to trace a route from each node to a root using a recursive CTE - which is a technique in SQL to traverse hierarchical directed graphs (either from top-to-bottom, or bottom-to-top), so it's a good thing we converted the data into a directed graph first.

For example:

Like so:


-- This `smolRoots` CTE is the same as the query from Step 2 above:
WITH smolRoots AS (
    
    SELECT
        DISTINCT
        l.Smol AS Smollest
    FROM
        -- This is an anti-join (a LEFT OUTER JOIN combined with a IS NULL predicate):
        @normalized AS l
        LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
    WHERE
        r.Big IS NULL


),

-- Generate a simple flat list of every Id1/Id2 value:
everyNode AS (

    SELECT DISTINCT Smol AS Id FROM @normalized
    UNION
    SELECT DISTINCT Big  AS Id FROM @normalized
),

-- Now do the tree-walk, it's like a nature-walk but more math-y:
recursiveCte AS (
    
    -- Each root (Smol) value:
    SELECT
        CONVERT( char(1), NULL ) AS Smol,
        s.Smol AS Big,
        s.Smol AS SmolRoot,
        0      AS Depth
    FROM
        smolRoots AS s

    UNION ALL

    -- Then recurisvely UNION ALL (concatenate) all other rows that can be connected by their Smol-to-Big values:
    SELECT
        f.Smol,
        f.Big,
        r.SmolRoot,
        r.Depth + 1 AS Depth
        
    FROM
        @normalized AS f
        INNER JOIN recursiveCte AS r ON f.Smol = r.Big
)

The above recursiveCte, if evalated as-is, will return this output:

Smol	Big	SmolRoot	Depth
NULL	a	a	0
NULL	f	f	0
NULL	l	l	0
l	m	l	1
f	g	f	1
f	k	f	1
a	b	a	1
a	d	a	1
b	c	a	2

Notice how for each row, each Smol and Big value is mapped to one of the 3 SmolRoot values identified in Step 2.

...so I hope you can now start to see how this works. But we're not done yet...

Step 3.2: I lied about Step 3 being Step 3, it's really Step 3.1:

We still need to now then query the recursiveCte's results to convert those SmolRoot values into unique identifiers for each disconnected component (i.e. each set) - so let's generate a new 1-based int value for each distinct value in SmolRoot - we can do this with ROW_NUMBER() but you could also use GENERATE_SERIES - and I'm sure other techniques exist to do this too:

WITH smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),

numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
    
    SELECT
        SmolRoot,
        ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
    FROM
        recursiveCte AS r
    GROUP BY
        r.SmolRoot

And numberForEachRoot looks like this:

SmolRoot	DisconnectedComponentNumber
a	1
f	2
l	3

Step 5: There is no Step 4

Now ignore all of the above and start over with that everyNode CTE that was tucked-away in Step 2: Take the everyNode CTE and JOIN it to recursiveCte and numberForEachRoot to get the actual output you're after:

WITH
smolRoots         AS ( /* same as above */ ),
everyNode         AS ( /* same as above */ ),
recursiveCte      AS ( /* same as above */ ),
numberForEachRoot AS ( /* same as above */ )
SELECT
    e.Id,
    n.DisconnectedComponentNumber
FROM
    everyNode AS e
    INNER JOIN recursiveCte AS r ON e.Id = r.Big
    INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
    e.Id;

Which gives us...

Id	DisconnectedComponentNumber
a	1
b	1
d	1
c	1
f	2
g	2
k	2
l	3
m	3

This technique also works on graphs with cycles, though you might get duplicate output rows in that case, but adjusting the above query to filter those out is a trivial exercise left to the reader.

_{Also it seems I spent too much time on this answer, that means the Vyvanse is working tonight. And yup, it's just gone 4am where I am right now, good job, me. Now where's my Ambien gone?}

Step ∞: Just give me the solution I can copy-and-paste into my CS homework and/or Nissan car firmware

Alright, you asked for it...


DECLARE @horribleInputData TABLE (
    Id1 char(1) NOT NULL,
    Id2 char(1) NOT NULL
);

INSERT INTO @horribleInputData ( Id1, Id2 ) VALUES
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m'),
-- Also adding this to show that it works for cycles in graphs too:
( '0', '1' ),
( '1', '2' ),
( '2', '3' ),
( '3', '0' );

-------------

-- 1. Normalize to a table form that's easier to work with: given that transitivity exists, enforce `Id1 < Id2` so we can have a "direction" to look in.
-- This can be a CTE step instead of a TABLE too, if you dare. I'm curious what the execution plan would be in that case.

DECLARE @normalized TABLE (
    Smol char(1) NOT NULL,
    Big  char(1) NOT NULL,

    INDEX IDX1 ( Smol ),
    INDEX IDX2 ( Big  ),

    CHECK ( Smol < Big )
);

INSERT INTO @normalized ( Smol, Big )
SELECT
    DISTINCT
    CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol,
    CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
    @horribleInputData
WHERE
    Id1 <> Id2  -- Exclude zero-information rows.
ORDER BY
    Smol, -- Make it easier to read, interactively.
    Big;

/*
Smol    Big
----------
a   b
a   d
b   c
f   g
f   k
l   m
*/

-- Also, just gonna bury this in here and see what happens:
DECLARE @superImportantPart nvarchar(300) = CONVERT( nvarchar(300), 0x4800450059002000450056004500520059004F004E00450020004900200043004F0050005900200041004E004400200050004100530054004500200043004F00440045002000460052004F004D00200053005400410043004B004F0056004500520046004C004F005700200057004900540048004F0055005400200055004E004400450052005300540041004E00440049004E00470020005700480041005400200049005400200044004F004500530020004F005200200048004F005700200049005400200057004F0052004B00530020004F005400480045005200570049005300450020004900200057004F0055004C004400200048004100560045002000520045004D004F005600450044002000540048004900530020005000520049004E0054002000530054004100540045004D0045004E005400 );
RAISERROR( @superImportantPart, /*severity:*/ 0, /*state:*/ 1 ) WITH NOWAIT;

-- Then trace a route from every Big to its smallest connected Smol.
-- Each Big sharing the same Smol is in the same connected graph, the set of distinct Smol nodes identifies each output set.


-- 1. Get all roots first: these will be `Smol` nodes that never appear in `Big`.
WITH smolRoots AS (
    
    SELECT
        DISTINCT
        l.Smol AS Smollest
    FROM
        @normalized AS l
        LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
    WHERE
        r.Big IS NULL

/*
Smollest
-----
a
f
l
*/
),
everyNode AS (

    SELECT DISTINCT Smol AS Id FROM @normalized
    UNION
    SELECT DISTINCT Big AS Id FROM @normalized
),
-- The tree-walk:
recursiveCte AS (
    
    -- Each root (Smol) value:
    SELECT
        CONVERT( char(1), NULL ) AS Smol,
        s.Smollest AS Big,
        s.Smollest AS SmolRoot,
        0          AS Depth
    FROM
        smolRoots AS s

    UNION ALL

    -- Then the magic happens...
    SELECT
        n.Smol,
        n.Big,
        r.SmolRoot,
        r.Depth + 1 AS Depth
        
    FROM
        @normalized AS n
        INNER JOIN recursiveCte AS r ON n.Smol = r.Big

/*
Smol    Big SmolRoot    Depth
-----------------------------
NULL    a   a   0
NULL    f   f   0
NULL    l   l   0
l   m   l   1
f   g   f   1
f   k   f   1
a   b   a   1
a   d   a   1
b   c   a   2
*/
),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
    
    SELECT
        SmolRoot,
        ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
    FROM
        recursiveCte AS r
    GROUP BY
        r.SmolRoot

/*
SmolRoot    DisconnectedComponentNumber
-------------------
a   1
f   2
l   3
*/
)
-- Then ignore all of the above and start with `everyNode` and JOIN it to `recursiveCte` and `numberForEachRoot`:

SELECT
    e.Id,
    n.DisconnectedComponentNumber
FROM
    everyNode AS e
    INNER JOIN recursiveCte AS r ON e.Id = r.Big
    INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
    e.Id;

/*
Id  DisconnectedComponentNumber
a   1
f   2
l   3
m   3
g   2
k   2
b   1
d   1
c   1
*/

Very thorough explanation and nice touch with the `@superImportantPart` ;-) , but I'm afraid, your query doesn't always return correct result. I wrote an answer to a similar problem a while ago: [How to find all connected subgraphs of an undirected graph](https://stackoverflow.com/q/35254260/4116017). Try the sample data #2 in my answer with your query. With this input: `('b', 'f'), ('b', 'j'), ('d', 'f');` Your final query produces two DisconnectedComponentNumbers, while there should be 1, since all the nodes are connected. — Vladimir Baranov, Mar 16 '23 at 06:20
Another example: `('a', 'z'), ('a', 'g'), ('z', 'h'), ('l', 'h') `. It should be one group. I think the problem comes from your approach to find "roots" - `it would be a node with no "outgoing" references from Smol to Big` - it seems that it is not that simple in general. — Vladimir Baranov, Mar 16 '23 at 06:37
@VladimirBaranov Yes, you are correct. I'm inclined to delete this answer as I don't want to give bad information out. From what I can tell, my approach only works in the case when nodes' labels are increasing with edge-distance from the root - and extending my approach to handle all cases ends-up with basically the same as your answer's code, except without `FOR XML PATH` now that we have `STRING_AGG` :) — Dai, Mar 16 '23 at 21:52
It is up to you, but you've written a very nice explanation with illustrations. Maybe just add a note detailing in which cases this approach works and in which doesn't. I think it is important to show the limitations of a method, especially if it is not evident. — Vladimir Baranov, Mar 16 '23 at 23:00