Your table describes a nondirected graph, where the set (distinct values) of all Id1
and Id2
values over all tuples ("rows") represent nodes and the tuples themselves represent the node's edges (as they connect - or link - these Id values in a relationship) - which means we can apply graph-theory techniques to solve your problem, which can be stated as finding disconnected components and assigning them an identifier) within your graph (as each disconnected component represents each set of connected nodes).
...with that formalism out of the way, I'll say I don't think it's necessarily correct to use the word "equivalent" or "equivalence": just because a binary relation (each of your tuples) is transitive (i.e. nondirected node edges are a transitive relation) it says nothing about what the nodes themselves represent, so I'm guessing you meant to say the Id1
/Id2
values represent things-that-are-equivalent in your problem-domain, which is fair, but in the context of my answer I'll refrain from using that term (not least because "set equivalence" is something else entirely)

ANYWAY...
Step 1: Normalize your bad data:
Assuming that those Id1
/Id2
values are comparable values (as in: every value can be sorted consistently and deterministically by a less-than comparison), such as strings or integers, then do that first so we can generate a normalized representation of your data.
This normalized representation of your data would be the same as your current data, except that:
- Every row's
Id1
value must be <
(i.e. less-than) its Id2
value.
- There are no duplicate rows.
So if there's a row where Id2 < Id1
you should swap the values, so ( 'd', 'a' )
becomes ( 'a', 'd' )
, and the duplicate ('a', 'b')
row would be removed.
This also means that every (disconnected) component in the graph we're representing now has a minimum node which we can treat as the (arbitrary) "root" of that component - which means we will have a target to look for from any node (but we don't know which values are minimums in each component yet, hang on...)
To aid identification, let's call the smallest value in a pair the smol value and the big value.

(You can do this step as another CTE step, but for clarity I'm using a
table-variable @normalized
, like so):
DECLARE @horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO @horribleInputData ( Id1, Id2 ) VALUES
( 'a', 'b' ),
( 'b', 'c' ),
( 'd', 'a' ),
( 'a', 'b' ),
( 'f', 'g' ),
( 'f', 'k' ),
( 'l', 'm' );
--
DECLARE @normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO @normalized ( Smol, Big )
SELECT
DISTINCT -- Exclude duplicate rows.
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol, -- Sort the values horizontally, not vertically.
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
@horribleInputData
WHERE
Id1 <> Id2; -- Exclude zero-information rows.
After running the above, the @normalized
table looks like this:
Smol |
Big |
a |
b |
a |
d |
b |
c |
f |
g |
f |
k |
l |
m |
Step 2: Designate the minimum node as root in each disconnected component:
If we treat the node with the minimum-value as the "root" in a directed graph (or in this case, the minimum node in each disconnected component) then we can find connected nodes by finding a path from each node to that root. But what else defines a root? In this case, it would be a node with no "outgoing" references from Smol
to Big
- so we can simply take @normalized
and do an anti-join to itself (i.e. for every Smol
, see if that Smol
is a Big
to any other Smol
- if none exist then that Smol
is the Smollest
- and is therefore a root:
SELECT
DISTINCT
l.Smol AS Smollest
FROM
@normalized AS l
LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL;
Which gives us this output:
...which tells us that there's 3 disconnected-components in this graph, and what those minimum nodes are.
Step 3: the srsbsns part
Which means we can now try to trace a route from each node to a root using a recursive CTE - which is a technique in SQL to traverse
hierarchical directed graphs (either from top-to-bottom, or bottom-to-top), so it's a good thing we converted the data into a directed graph first.
For example:
Like so:
-- This `smolRoots` CTE is the same as the query from Step 2 above:
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
-- This is an anti-join (a LEFT OUTER JOIN combined with a IS NULL predicate):
@normalized AS l
LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
),
-- Generate a simple flat list of every Id1/Id2 value:
everyNode AS (
SELECT DISTINCT Smol AS Id FROM @normalized
UNION
SELECT DISTINCT Big AS Id FROM @normalized
),
-- Now do the tree-walk, it's like a nature-walk but more math-y:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smol AS Big,
s.Smol AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then recurisvely UNION ALL (concatenate) all other rows that can be connected by their Smol-to-Big values:
SELECT
f.Smol,
f.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
@normalized AS f
INNER JOIN recursiveCte AS r ON f.Smol = r.Big
)
The above recursiveCte
, if evalated as-is, will return this output:
Smol |
Big |
SmolRoot |
Depth |
NULL |
a |
a |
0 |
NULL |
f |
f |
0 |
NULL |
l |
l |
0 |
l |
m |
l |
1 |
f |
g |
f |
1 |
f |
k |
f |
1 |
a |
b |
a |
1 |
a |
d |
a |
1 |
b |
c |
a |
2 |
Notice how for each row, each Smol
and Big
value is mapped to one of the 3 SmolRoot
values identified in Step 2.
...so I hope you can now start to see how this works. But we're not done yet...
Step 3.2: I lied about Step 3 being Step 3, it's really Step 3.1:
We still need to now then query the recursiveCte
's results to convert those SmolRoot
values into unique identifiers for each disconnected component (i.e. each set) - so let's generate a new 1-based int
value for each distinct value in SmolRoot
- we can do this with ROW_NUMBER()
but you could also use GENERATE_SERIES
- and I'm sure other techniques exist to do this too:
WITH smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
And numberForEachRoot
looks like this:
SmolRoot |
DisconnectedComponentNumber |
a |
1 |
f |
2 |
l |
3 |
Step 5: There is no Step 4
Now ignore all of the above and start over with that everyNode
CTE that was tucked-away in Step 2: Take the everyNode
CTE and JOIN
it to recursiveCte
and numberForEachRoot
to get the actual output you're after:
WITH
smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( /* same as above */ )
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
Which gives us...
Id |
DisconnectedComponentNumber |
a |
1 |
b |
1 |
d |
1 |
c |
1 |
f |
2 |
g |
2 |
k |
2 |
l |
3 |
m |
3 |
This technique also works on graphs with cycles, though you might get duplicate output rows in that case, but adjusting the above query to filter those out is a trivial exercise left to the reader.
Also it seems I spent too much time on this answer, that means the Vyvanse is working tonight. And yup, it's just gone 4am where I am right now, good job, me. Now where's my Ambien gone?
Step ∞: Just give me the solution I can copy-and-paste into my CS homework and/or Nissan car firmware
Alright, you asked for it...
DECLARE @horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO @horribleInputData ( Id1, Id2 ) VALUES
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m'),
-- Also adding this to show that it works for cycles in graphs too:
( '0', '1' ),
( '1', '2' ),
( '2', '3' ),
( '3', '0' );
-------------
-- 1. Normalize to a table form that's easier to work with: given that transitivity exists, enforce `Id1 < Id2` so we can have a "direction" to look in.
-- This can be a CTE step instead of a TABLE too, if you dare. I'm curious what the execution plan would be in that case.
DECLARE @normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO @normalized ( Smol, Big )
SELECT
DISTINCT
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol,
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
@horribleInputData
WHERE
Id1 <> Id2 -- Exclude zero-information rows.
ORDER BY
Smol, -- Make it easier to read, interactively.
Big;
/*
Smol Big
----------
a b
a d
b c
f g
f k
l m
*/
-- Also, just gonna bury this in here and see what happens:
DECLARE @superImportantPart nvarchar(300) = CONVERT( nvarchar(300), 0x4800450059002000450056004500520059004F004E00450020004900200043004F0050005900200041004E004400200050004100530054004500200043004F00440045002000460052004F004D00200053005400410043004B004F0056004500520046004C004F005700200057004900540048004F0055005400200055004E004400450052005300540041004E00440049004E00470020005700480041005400200049005400200044004F004500530020004F005200200048004F005700200049005400200057004F0052004B00530020004F005400480045005200570049005300450020004900200057004F0055004C004400200048004100560045002000520045004D004F005600450044002000540048004900530020005000520049004E0054002000530054004100540045004D0045004E005400 );
RAISERROR( @superImportantPart, /*severity:*/ 0, /*state:*/ 1 ) WITH NOWAIT;
-- Then trace a route from every Big to its smallest connected Smol.
-- Each Big sharing the same Smol is in the same connected graph, the set of distinct Smol nodes identifies each output set.
-- 1. Get all roots first: these will be `Smol` nodes that never appear in `Big`.
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
@normalized AS l
LEFT OUTER JOIN @normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
/*
Smollest
-----
a
f
l
*/
),
everyNode AS (
SELECT DISTINCT Smol AS Id FROM @normalized
UNION
SELECT DISTINCT Big AS Id FROM @normalized
),
-- The tree-walk:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smollest AS Big,
s.Smollest AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then the magic happens...
SELECT
n.Smol,
n.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
@normalized AS n
INNER JOIN recursiveCte AS r ON n.Smol = r.Big
/*
Smol Big SmolRoot Depth
-----------------------------
NULL a a 0
NULL f f 0
NULL l l 0
l m l 1
f g f 1
f k f 1
a b a 1
a d a 1
b c a 2
*/
),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
/*
SmolRoot DisconnectedComponentNumber
-------------------
a 1
f 2
l 3
*/
)
-- Then ignore all of the above and start with `everyNode` and JOIN it to `recursiveCte` and `numberForEachRoot`:
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
/*
Id DisconnectedComponentNumber
a 1
f 2
l 3
m 3
g 2
k 2
b 1
d 1
c 1
*/