2

I have a list of organizations I'm trying to de-duplicate. Each organization may have up to three identification numbers across three different fields. The current tools at my disposal are an Oracle SQL database and SAP Data Services.

Name  |ID1     |ID2     |ID3
------|--------|--------|--------
Org1  |1       |<null>  |2        
Org2  |<null>  |1       |<null>  
Org3  |2       |<null>  |<null>    

All three of these orgs should be able to be identified as a single org.


First Approach

My first approach to this problem was to break it down to a ordered list of orgs and IDs.

Name  |ID        
------|-------
Org1  |1              
Org2  |1       
Org1  |2      
Org3  |2       

From that point, I managed to make a list of only duplicate IDs using a combination of data transforms that involved selecting just the list of possible IDs, ordering them, removing unique IDs, assigning each ID a ROW_ID (represented as a letter for the sake of this example, even though it appears redundant).

ID    |Duplicate_Group     
------|--------
1     |A              
2     |B   

But were I to join that back to my data, it wouldn't solve my problem. It just causes further duplication. I had accepted this course as my solution initially before I realised our organizations could each have multiple IDs:

Name  |ID1     |ID2     |ID3     |Duplicate_Group
------|--------|--------|--------|--------
Org1  |1       |<null>  |2       |A 
Org1  |1       |<null>  |2       |B 
Org2  |<null>  |1       |<null>  |A 
Org3  |2       |<null>  |<null>  |B   

New Approach

My next idea is to similarly assign letters to distinguish groups among my organizations, but ... loop over the data.

Name  |ID      |Duplicate_Group
------|--------|--------
Org1  |1       |A 
Org1  |2       |B 
Org2  |1       |A 
Org3  |2       |B   

First, ordered by the Name, the ID, I would check if the name or ID was the same as the previous row, then adopt that Duplicate Group as your own.

Name  |ID      |D_Grp1  |D_Grp2
------|--------|--------|--------
Org1  |1       |A       | 
Org1  |2       |B       |A
Org2  |1       |A       | 
Org3  |2       |B       | 

Notice the D_Grp2 for Org1 has changed in the second row. Now I'll consolidate the older D_Grp1 into D_Grp2 and do it again with a twist - order by the ID, then Name; then update the groups based on the previous row again.

Name  |ID      |D_Grp2  |D_Grp3
------|--------|--------|--------
Org1  |1       |A       | 
Org2  |1       |A       | 
Org1  |2       |A       | 
Org3  |2       |B       |A

Since the fourth line's ID was the same as above, but had a different D_Grp2, the fourth line would update it's D_Grp3 to match. The idea is I would loop this process of ordering by the ID and Name over and over until there are no more changes to make. I'll make some sort of column or variable act as a flag - after each loop, if there is no tick for the flag I'll assume everything is consolidated. I'll apply a distinct and whack it back in the original table.

Name  |ID1     |ID2     |ID3     |Duplicate_Group
------|--------|--------|--------|--------
Org1  |1       |<null>  |2       |A 
Org2  |<null>  |1       |<null>  |A 
Org3  |2       |<null>  |<null>  |A  

Scope-wise I have about 60 000 organizations, so looping isn't out of the question, but it grosses me out, will take a long time and seems like a failure to design a better process. I am also not quite sure if I have missed any edge case where sorting by Name and ID might never consolidate records.


Inconclusion

So StackOverflow, is there a better way to identify the duplicates in this table? I'm acceptable to any answers including SQL. Please understand that SAP Data Services's logic is basically SQL, but I'm not directly writing SQL of my own - so I cannot feasibly provide an SQL version of my process so far.

  • 1
    I think your new approach would fail for OrgA=1=2, OrgB=3=4, OrgC=2=4, as you'd end up with OrgA=1, OrgB=3, and OrgC=2, while all the three represent the same company because 1=2=4=3. – Thorsten Kettner Apr 26 '17 at 08:48
  • @ThorstenKettner, you are right, my *new approach* would get stuck alternating OrgC between two groups each iteration without resolving. – Walledhouse Apr 26 '17 at 22:35

2 Answers2

1

Do a self-JOIN over these three fields.

SELECT 
   o.Name

   match12.ID2
   match12.Name

   match23.ID3
   match23.Name

   match31.ID1
   match31.Name
FROM organisations AS o
INNER JOIN organisations AS match12 ON o.ID1 = match12.ID2 
INNER JOIN organisations AS match23 ON o.ID2 = match23.ID3 
INNER JOIN organisations AS match31 ON o.ID3 = match31.ID1 

This should give you a list of all organisations which have matches over different ID columns.

Note that self-JOINs can be quite expensive, so with a large dataset this query may take a while.

Philipp
  • 67,764
  • 9
  • 118
  • 153
  • Hi Phillipp, your answer was indeed an answer to identifying duplicates in the table, but I am actually after identifying *groups* of duplicates. **So if Bob is John, and John is Susan, Bob is also Susan.** I have edited the question title to better reflect what I am after. Also thanks for answering promptly. – Walledhouse Apr 26 '17 at 05:59
1

This definitly is an iterative process and thus asks for a recursive query.

One must be looking for chains. If IDs 1 = 2 and 2 = 3 and 4 = 5 and 5 = 6 and 2 = 5, then all these IDs mean the same company.

Here is my algorithm:

  1. Find all pairs 1 = 2, 2 = 1, 2 = 3, ...
  2. Find all chains (For ID 1: 1=2=3, 1=2=5, for ID 2: 2=1, 2=3, 2=5 for ID 3: 3=2=1, 3=2=5, ...) and give all the IDs in a chain the invoking ID as a chain number. (So ID 3 has chain number 1 from the first chain 2 from the fourth, and 3 from the sixth and seventh.)
  3. Build groups by associating each ID with its smallest chain number. (So it's chain number 1 for ID 3; ID 3 belongs to group "chain=1".) All IDs with the same smallest chain number represent the same company.

Here is the query:

with pairs as
(
  select id1 as id, id2 as other from mytable where id1 <> id2
  union all
  select id1 as id, id3 as other from mytable where id1 <> id3
  union all
  select id2 as id, id1 as other from mytable where id2 <> id1
  union all
  select id2 as id, id3 as other from mytable where id2 <> id3
  union all
  select id3 as id, id1 as other from mytable where id3 <> id1
  union all
  select id3 as id, id2 as other from mytable where id3 <> id2
)
, chains(chain, id) as
(
  select id as chain, id from pairs
  union all
  select c.chain, p.other as id
  from chains c
  join pairs p on p.id = c.id
)
cycle chain, id set cycle to 1 default 0
, groups as
(
  select id, min(chain) as grp 
  from chains 
  group by id
)
select distinct g.grp, m.*
from groups g
join mytable m on g.id in (m.id1, m.id2, m.id3)
order by g.grp, m.name;

This query can be very slow when you have a lot of IDs meaning the same company (i.e. a lot of chains to evaluate). If there are only few such occurrences, which seems more likely, then the query will be very quick. Just try :-)

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73