I have a database where several hundred records have been duplicated. However the duplicated information is not the same across all fields. For any two lines, the first line will contain information in some fields while the duplicate line's fields are blank; but then for other fields, the duplicate (second) line will contain information while the first line's fields are blank. For example, it looks like this:
ID Deleted Reference Name Case_Date Outcome Outcome_Date
100 False A123 Chris 2000-01-01 Yes
101 False A123 Chris 2000-03-31
The ID column is a unique primary key for the record. The Reference column is the one by which I can identify the duplicates. However as you can see, the first record (100) contains information in Case_Date and Outcome, but the second record (101) contains an Outcome_Date.
What I want to do is to copy the most amount of information into just one of each pair of records, and then mark the duplicate as deleted (I use a soft-delete, not actually removing records from the table but just flagging the Duplicate column as True). With the above example, I want it to look like this:
ID Deleted Reference Name Case_Date Outcome Outcome_Date
100 False A123 Chris 2000-01-01 Yes 2000-03-31
101 True A123 Chris (2000-01-01)* (Yes)* 2000-03-31
- Technically it will not be necessary to also copy information into the blank fields of the record which will be marked as deleted, but I figure it's easier to just copy everything and then mark the "second" record as the duplicate, rather than trying to work out which one contains more information and which one contains less.
I am also aware that it will be easier to run a separate SQL command for each column than to try to do them all at once. The columns shown above are a simplified example, and the information which may or may not be present across each column differs.
My select query for the record set of duplicates is:
SELECT *
FROM [Cohorts]
WHERE [Deleted] = False
AND ([CaseType] = "Female" OR [CaseType] = "Family")
AND [Reference] Is Not Null
And [Reference] In (SELECT [Reference] FROM [Cohorts] As Tmp
WHERE [Deleted] = False
AND ([CaseType] = "Female" OR [CaseType]="Family")
GROUP BY [Reference]
HAVING Count(*) > 1)
ORDER BY [Reference];
This will return all (Female/Family) records in the table [Cohorts] where there exists more than one record with the same Reference (and where the records have not been marked as deleted).
I'm running my queries from VBA via ADO, so can execute UPDATE statements. My database is an Access-compatible .mdb using the JET engine.
Grateful if anyone could suggest a suitable SQL command which I can run per column in order to populate the NULL fields with the values of the non-NULL fields from the relevant duplicate records. It's a bit beyond my SQL understanding at present! Thanks.