-1

I have a database where several hundred records have been duplicated. However the duplicated information is not the same across all fields. For any two lines, the first line will contain information in some fields while the duplicate line's fields are blank; but then for other fields, the duplicate (second) line will contain information while the first line's fields are blank. For example, it looks like this:

ID    Deleted    Reference    Name    Case_Date    Outcome    Outcome_Date
100   False      A123         Chris   2000-01-01   Yes
101   False      A123         Chris                           2000-03-31

The ID column is a unique primary key for the record. The Reference column is the one by which I can identify the duplicates. However as you can see, the first record (100) contains information in Case_Date and Outcome, but the second record (101) contains an Outcome_Date.

What I want to do is to copy the most amount of information into just one of each pair of records, and then mark the duplicate as deleted (I use a soft-delete, not actually removing records from the table but just flagging the Duplicate column as True). With the above example, I want it to look like this:

ID    Deleted    Reference    Name    Case_Date      Outcome    Outcome_Date
100   False      A123         Chris   2000-01-01     Yes        2000-03-31
101   True       A123         Chris   (2000-01-01)*  (Yes)*     2000-03-31
  • Technically it will not be necessary to also copy information into the blank fields of the record which will be marked as deleted, but I figure it's easier to just copy everything and then mark the "second" record as the duplicate, rather than trying to work out which one contains more information and which one contains less.

I am also aware that it will be easier to run a separate SQL command for each column than to try to do them all at once. The columns shown above are a simplified example, and the information which may or may not be present across each column differs.

My select query for the record set of duplicates is:

SELECT *
FROM [Cohorts]
WHERE [Deleted] = False
 AND ([CaseType] = "Female" OR [CaseType] = "Family")
 AND [Reference] Is Not Null
 And [Reference] In (SELECT [Reference] FROM [Cohorts] As Tmp
                     WHERE [Deleted] = False
                       AND ([CaseType] = "Female" OR [CaseType]="Family")
                     GROUP BY [Reference]
                     HAVING Count(*) > 1)
ORDER BY [Reference];

This will return all (Female/Family) records in the table [Cohorts] where there exists more than one record with the same Reference (and where the records have not been marked as deleted).

I'm running my queries from VBA via ADO, so can execute UPDATE statements. My database is an Access-compatible .mdb using the JET engine.

Grateful if anyone could suggest a suitable SQL command which I can run per column in order to populate the NULL fields with the values of the non-NULL fields from the relevant duplicate records. It's a bit beyond my SQL understanding at present! Thanks.

jarlh
  • 42,561
  • 8
  • 45
  • 63
Chris Melville
  • 1,476
  • 1
  • 14
  • 30
  • Haven't added as an answer as it's not a complete answer - using your example table minus the ID & Deleted fields this will return a single record containing all the information from the other 5 fields: `SELECT DISTINCT * FROM (SELECT DISTINCT NZ(T1.Reference, T2.Reference) AS F1, NZ(T1.sName,T2.sName) AS F2, NZ(T1.Case_Date,T2.Case_Date) AS F3, NZ(T1.Outcome,T2.Outcome) AS F4, NZ(T1.Outcome_Date,T2.Outcome_Date) AS F5 FROM Cohorts T1 INNER JOIN Cohorts T2 ON T1.Reference = T2.Reference) WHERE NOT F1 IS NULL AND NOT F2 IS NULL AND NOT F3 IS NULL AND NOT F4 IS NULL AND NOT F5 IS NULL` – Darren Bartrup-Cook Jun 21 '16 at 12:35

1 Answers1

0

My first UPDATE JOIN ever, hope it works (untested):

update t1
  set t1.name = coalesce(t1.name, t2.name),
      t1.Case_Date = coalesce(t1.Case_Date, t2.Case_Date),
      t1.Outcome = coalesce(t1.Outcome, t2.Outcome),
      t1.Outcome_Date = coalesce(t1.Outcome_Date, t2.Outcome_Date),
      t1.deleted = case when t1.id < t2.id then FALSE else TRUE end
from Cohorts t1
  join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id 

Edit: Alternative solution:

Create a copy table, do insert select:

insert into CohortsCopy (Deleted, Reference, Name, Case_Date, Outcome, Outcome_Date)
select case when t1.id < t2.id or t2.id is null then FALSE else TRUE end,
       coalesce(t1.Reference, t2.Reference),
       coalesce(t1.name, t2.name),
       coalesce(t1.Case_Date, t2.Case_Date),
       coalesce(t1.Outcome, t2.Outcome),
       coalesce(t1.Outcome_Date, t2.Outcome_Date)
from Cohorts t1
  left join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id

Then either rename, or copy back to original table.

jarlh
  • 42,561
  • 8
  • 45
  • 63
  • Thanks, but there's a syntax error in this. You have UPDATE t1 SET (various) FROM. I think you need a WHERE, not a FROM - although I'm not sure how it should be. Can you (or anyone else) take another look? :) – Chris Melville Jun 21 '16 at 13:37
  • @ChrisMelville, see http://stackoverflow.com/questions/982919/sql-update-query-using-joins#982947, similar isn't it? – jarlh Jun 21 '16 at 13:48
  • I have been doing some further poking around, and I think your solution might be correct - however it is not working for me because I am not using MySQL Server, I am using an Access-compatible .mdb file upon which I'm using ADO to execute SQL commands. Just found an answer here (http://stackoverflow.com/questions/27977627/using-an-adodb-record-set-to-perform-a-joined-update-query) which says ADO doesn't allow it. Unless you can shed any more light? – Chris Melville Jun 21 '16 at 14:07