0

Problem description

I have the following tables:

  • Business (id uniqueidentifier, other columns)
  • Address (id uniqueidentifier, businessId uniqueidentifier, type nvarchar, validFrom datetime, validTo datetime, other columns)
  • AddressComponent (id uniqueidentifier, addressId uniqueidentifier, value nvarchar, type nvarchar, other columns)

An address is unique if there are no other addresses which are currently valid attached to the same business with the same address type, and and the same number of components with the same values and types.

In production we have managed to get several duplicate entries over the years. I'm fairly confident that it's possible to find all the duplicate values (and thereby remove them) with a tsql-query, but the exact details of the solution keep eluding me.

What I've tried so far

I've found a lot of great answers that come close (such as the examples found here Finding duplicate values in a SQL table ). What I've managed to do myself is along the lines of:

select
    b.id, ac.Text, ac.type, COUNT(ac.value) as amount, count(b.id) as numBiz
from  
    Businesse b
inner join 
    Address adr on adr.businessId = b.id
inner join 
    AddressComponent ac on ac.addressId = adr.id
where 
    adr.validTo > GETDATE()
group by
    ac.text, ac.type, adr.type, b.id
having 
    COUNT(ac.text) > 1

It comes close, but what I struggle with is how to make it look for duplicated combinations of components instead of duplicated components. A business may for an instance have several addresses with the same postal code that I don't care about (but which my query finds). But I want to find duplicated combinations of components attached to different addresses attached to the same business.


sample data

  • Business(id uniqueidentifier, other fields)
1   'asdf'
2   'qwer'
3   'zxcv'
  • Address(id uniqueidentifier, businessId uniqueidentifier, type nvarchar, validFrom datetime, validTo datetime, other fields)
1   1   'electronic'    2016-01-01  2099-12-31  'hjkl'
2   1   'electronic'    2016-01-01  2099-12-31  'qwer'
3   1   'postal'    2016-01-01  2099-12-31  'xcvb'
4   1   'visiting'  2016-01-01  2099-12-31  'qwr'
5   1   'visiting'  2016-01-01  2099-12-31  'qwr'
6   2   'electronic'    2016-01-01  2099-12-31  'zxcv'
7   3   'electronic'    2016-01-01  2099-12-31  'fghj'
  • AddressComponent(id uniqueidentifier, addressId uniqueidentifier, value nvarchar, type nvarchar, other fields)
1   1   'example@example.com'   'email' 'asdf'
2   2   'example@example.com'   'email' 'qwer'
3   3   'road number 1' 'street'    'sdfg'
4   3   '1234'  'postal code'   'dgfh'
5   3   'place' 'postal area'   'cbvn'
6   4   'road number 2' 'street'    'sdfg'
7   4   '1234'  'postal code'   'dgfh'
8   4   'place' 'postal area'   'cbvn'
9   6   'example@example.com'   'email' 'xcvb'
10  7   'another-email@example.tld' 'email' 'dsvv'
11  5   'road number 2' 'street'    'sdfg'
12  5   '1234'  'postal code'   'dgfh'
13  5   'place' 'postal area'   'cbvn'

sample output

number of duplicated addresses: 4

The duplicates:

businessId  duplicatedAddressIds 
1           1,2
1           4,5
Community
  • 1
  • 1
simonra
  • 245
  • 2
  • 8
  • 2
    some sample data may help along with expected output – ughai Nov 08 '16 at 09:31
  • `But I want to find duplicated combinations of components attached to different addresses attached to the same business.` Do you mean: **all the** component-values are the same between the two adresses?, or: the adressess share *one or more* component value(s)? – joop Nov 08 '16 at 10:38
  • @joop All the component values **and** the types are the same between the two. If it was only _one or more_ the sample output above would have read `1 3,4,5` on the second line due to the duplicated postal code/area components of those addresses. I'm sorry if it wasn't clear from the post. – simonra Nov 08 '16 at 10:44
  • The joys of EAV: you want to detect duplicate addresses, based on the values of their address-components. (EXISTS is your friend here) – joop Nov 08 '16 at 11:08

1 Answers1

0

This is the selection you want.


SELECT b.id
        , a.id AS aid , a.type AS atype
        , ac.id AS acid , ac.type AS actype
        , ac.Text AS actext
FROM  Business b
JOIN Address a ON a.businessId = b.id

JOIN AddressComponent ac ON ac.addressId = a.id
WHERE a.validTo > CURRENT_DATE
AND EXISTS ( SELECT *
    FROM Address ax
    JOIN AddressComponent acx ON acx.addressId = ax.id
    WHERE ax.businessId = a.businessId -- same business
    AND ax.validTo > CURRENT_DATE     -- same selection
    AND ax.type = a.type              -- same address type
    AND acx.type = ac.type  -- same component
    AND acx.text = ac.text  -- same value
    AND ax.id <> a.id       -- but a *different* addres!!!
    )
ORDER BY b.id, a.id, ac.id
    ;

Result:


 id | aid |   atype    | acid |   actype    |       actext        
----+-----+------------+------+-------------+---------------------
  1 |   1 | electronic |    1 | email       | example@example.com
  1 |   2 | electronic |    2 | email       | example@example.com
  1 |   4 | visiting   |    6 | street      | road number 2
  1 |   4 | visiting   |    7 | postal code | 1234
  1 |   4 | visiting   |    8 | postal area | place
  1 |   5 | visiting   |   11 | street      | road number 2
  1 |   5 | visiting   |   12 | postal code | 1234
  1 |   5 | visiting   |   13 | postal area | place
(8 rows)

The rest is a matter of aggregation (basically to lose the AddressComponents, which are 1:N related to Address)

NOTE: CURRENT_DATE and CURRENT_TIMSTAMP are supposed to be ANSI SQL. now() and getdate() are implemention specific extentions.

joop
  • 4,330
  • 1
  • 15
  • 26
  • `NOW` is MySQL AFAICT, maybe you meant `GETDATE()`? – TT. Nov 08 '16 at 11:30
  • I had to change the getdate() function to NOW(), presuming they are basically the same. CURRENT_DATE would probably be portable. – joop Nov 08 '16 at 11:38
  • SQL Server supports: `GETDATE()` or `CURRENT_TIMESTAMP` to get the current datetime. `NOW` is not a supported function by SQL Server. Some alternatives you can find [here](https://msdn.microsoft.com/en-us/library/ms188751.aspx). – TT. Nov 08 '16 at 11:41
  • It worked great, just two things; `CURRENT_DATE` didn't work for me locally, but no biggie, I just changed it to `GETDATE()`. Also it found addresses where one was a superset of the other in terms of components. But I didn't really specify what I thought should happen in that case anyways, so going to mark resolved. (I guess for deletion I will try to make a query that deletes all duplicates and keeps the one with most information in case of overlaps.) – simonra Nov 09 '16 at 07:52
  • IMO current_date and current_timestamp are ANSI standard. now() is postgres/mysql/oracle specific, and getdate() is a sybase/microsoft invention. – joop Nov 10 '16 at 10:12