I have a table of identities
(i.e. aliases) for an arbitrary number of people. Each row has a previous name and a new name. In production, there are about 1 M rows. For example:
id, old, new
---
1, 'Albert', 'Bob'
2, 'Bob', 'Charles'
3, 'Mary', 'Nancy'
4, 'Charles', 'Albert'
5, 'Lydia', 'Nancy'
6, 'Zoe', 'Zoe'
What I want is to generate the list of users
and reference all of their respective identities. This is analogous to finding all of the nodes in each graph of connected identities, or finding the spanning forest:
User 1: Albert, Bob, Charles (identities: 1,2,4)
User 2: Mary, Nancy, Lydia (identities: 3,5)
User 3: Zoe (identities: 6)
I've been tinkering with PostgreSQL's WITH RECURSIVE
, but it yields both sets and subsets for each. For example:
1,2,4 <-- spanning tree: good
2 <-- subset: discard
3,5 <-- spanning tree: good
4 <-- subset: discard
5 <-- subset: discard
6 <-- spanning tree: good
What do I need to do to only produce the full sets of identities (i.e. the spanning tree) for each user?
SQLFiddle: http://sqlfiddle.com/#!15/9eaed/4 with my latest attempt. Here's the code:
WITH RECURSIVE search_graph AS (
SELECT id
, id AS min_id
, ARRAY[id] AS path
, ARRAY[old,new] AS emails
FROM identities
UNION
SELECT identities.id
, LEAST(identities.id, sg.min_id)
, (sg.path || identities.id)
, (sg.emails || identities.old || identities.new)
FROM search_graph sg
JOIN identities ON (identities.old = ANY(sg.emails) OR identities.new = ANY(sg.emails))
WHERE identities.id <> ALL(sg.path)
)
SELECT array_agg(DISTINCT(p)) from search_graph, unnest(path) p GROUP BY min_id;
And the results:
1,2,4
2
3,5
4
5
6