I have a list of organizations I'm trying to de-duplicate. Each organization may have up to three identification numbers across three different fields. The current tools at my disposal are an Oracle SQL database and SAP Data Services.
Name |ID1 |ID2 |ID3
------|--------|--------|--------
Org1 |1 |<null> |2
Org2 |<null> |1 |<null>
Org3 |2 |<null> |<null>
All three of these orgs should be able to be identified as a single org.
First Approach
My first approach to this problem was to break it down to a ordered list of orgs and IDs.
Name |ID
------|-------
Org1 |1
Org2 |1
Org1 |2
Org3 |2
From that point, I managed to make a list of only duplicate IDs using a combination of data transforms that involved selecting just the list of possible IDs, ordering them, removing unique IDs, assigning each ID a ROW_ID (represented as a letter for the sake of this example, even though it appears redundant).
ID |Duplicate_Group
------|--------
1 |A
2 |B
But were I to join that back to my data, it wouldn't solve my problem. It just causes further duplication. I had accepted this course as my solution initially before I realised our organizations could each have multiple IDs:
Name |ID1 |ID2 |ID3 |Duplicate_Group
------|--------|--------|--------|--------
Org1 |1 |<null> |2 |A
Org1 |1 |<null> |2 |B
Org2 |<null> |1 |<null> |A
Org3 |2 |<null> |<null> |B
New Approach
My next idea is to similarly assign letters to distinguish groups among my organizations, but ... loop over the data.
Name |ID |Duplicate_Group
------|--------|--------
Org1 |1 |A
Org1 |2 |B
Org2 |1 |A
Org3 |2 |B
First, ordered by the Name, the ID, I would check if the name or ID was the same as the previous row, then adopt that Duplicate Group as your own.
Name |ID |D_Grp1 |D_Grp2
------|--------|--------|--------
Org1 |1 |A |
Org1 |2 |B |A
Org2 |1 |A |
Org3 |2 |B |
Notice the D_Grp2 for Org1 has changed in the second row. Now I'll consolidate the older D_Grp1 into D_Grp2 and do it again with a twist - order by the ID, then Name; then update the groups based on the previous row again.
Name |ID |D_Grp2 |D_Grp3
------|--------|--------|--------
Org1 |1 |A |
Org2 |1 |A |
Org1 |2 |A |
Org3 |2 |B |A
Since the fourth line's ID was the same as above, but had a different D_Grp2, the fourth line would update it's D_Grp3 to match. The idea is I would loop this process of ordering by the ID and Name over and over until there are no more changes to make. I'll make some sort of column or variable act as a flag - after each loop, if there is no tick for the flag I'll assume everything is consolidated. I'll apply a distinct and whack it back in the original table.
Name |ID1 |ID2 |ID3 |Duplicate_Group
------|--------|--------|--------|--------
Org1 |1 |<null> |2 |A
Org2 |<null> |1 |<null> |A
Org3 |2 |<null> |<null> |A
Scope-wise I have about 60 000 organizations, so looping isn't out of the question, but it grosses me out, will take a long time and seems like a failure to design a better process. I am also not quite sure if I have missed any edge case where sorting by Name and ID might never consolidate records.
Inconclusion
So StackOverflow, is there a better way to identify the duplicates in this table? I'm acceptable to any answers including SQL. Please understand that SAP Data Services's logic is basically SQL, but I'm not directly writing SQL of my own - so I cannot feasibly provide an SQL version of my process so far.