Problem description
I have the following tables:
Business
(id uniqueidentifier, other columns)Address
(id uniqueidentifier, businessId uniqueidentifier, type nvarchar, validFrom datetime, validTo datetime, other columns)AddressComponent
(id uniqueidentifier, addressId uniqueidentifier, value nvarchar, type nvarchar, other columns)
An address is unique if there are no other addresses which are currently valid attached to the same business with the same address type, and and the same number of components with the same value
s and type
s.
In production we have managed to get several duplicate entries over the years. I'm fairly confident that it's possible to find all the duplicate values (and thereby remove them) with a tsql-query, but the exact details of the solution keep eluding me.
What I've tried so far
I've found a lot of great answers that come close (such as the examples found here Finding duplicate values in a SQL table ). What I've managed to do myself is along the lines of:
select
b.id, ac.Text, ac.type, COUNT(ac.value) as amount, count(b.id) as numBiz
from
Businesse b
inner join
Address adr on adr.businessId = b.id
inner join
AddressComponent ac on ac.addressId = adr.id
where
adr.validTo > GETDATE()
group by
ac.text, ac.type, adr.type, b.id
having
COUNT(ac.text) > 1
It comes close, but what I struggle with is how to make it look for duplicated combinations of components instead of duplicated components. A business may for an instance have several addresses with the same postal code that I don't care about (but which my query finds). But I want to find duplicated combinations of components attached to different addresses attached to the same business.
sample data
- Business(id uniqueidentifier, other fields)
1 'asdf'
2 'qwer'
3 'zxcv'
- Address(id uniqueidentifier, businessId uniqueidentifier, type nvarchar, validFrom datetime, validTo datetime, other fields)
1 1 'electronic' 2016-01-01 2099-12-31 'hjkl'
2 1 'electronic' 2016-01-01 2099-12-31 'qwer'
3 1 'postal' 2016-01-01 2099-12-31 'xcvb'
4 1 'visiting' 2016-01-01 2099-12-31 'qwr'
5 1 'visiting' 2016-01-01 2099-12-31 'qwr'
6 2 'electronic' 2016-01-01 2099-12-31 'zxcv'
7 3 'electronic' 2016-01-01 2099-12-31 'fghj'
- AddressComponent(id uniqueidentifier, addressId uniqueidentifier, value nvarchar, type nvarchar, other fields)
1 1 'example@example.com' 'email' 'asdf'
2 2 'example@example.com' 'email' 'qwer'
3 3 'road number 1' 'street' 'sdfg'
4 3 '1234' 'postal code' 'dgfh'
5 3 'place' 'postal area' 'cbvn'
6 4 'road number 2' 'street' 'sdfg'
7 4 '1234' 'postal code' 'dgfh'
8 4 'place' 'postal area' 'cbvn'
9 6 'example@example.com' 'email' 'xcvb'
10 7 'another-email@example.tld' 'email' 'dsvv'
11 5 'road number 2' 'street' 'sdfg'
12 5 '1234' 'postal code' 'dgfh'
13 5 'place' 'postal area' 'cbvn'
sample output
number of duplicated addresses: 4
The duplicates:
businessId duplicatedAddressIds
1 1,2
1 4,5