I have a table for an historical census database in which four fields collectively identify each individual. The four fields represent different levels of governmental areas, making the table look like this:
KOMMNR KRETSNR BOSTNR PERSNR FORNVN ETTNVN
[mncpl] [area] [rsdnc] [prsn] [firstn] [lastn] ← english
0101 001 0001 001 John Doe
0101 001 0001 002 Richard Doe
0101 001 0001 003 Johnny Doe
0101 001 0002 001 Jane Doe
As can be seen, individuals are identified by combining the four numbers (short text format) for municipality, area, residence and person. Personal ID numbers have been added to other databases, but are currently pending for implementation in this one.
The database I am currently working with holds roughly 900k posts, and I have ended up with ~12k duplicates, which I need to remove. Example from table [T3 3 clean]:
KOMMNR KRETSNR BOSTNR PERSNR FORNVN ETTNVN
[mncpl] [area] [rsdnc] [prsn] [firstn] [lastn] ← english
0101 001a 0003 5 Ole Christian Elingsen
0101 001a 0003 5 Ole Christian Elingsen
1101 001a 0003 6 Kristian Johannesen
1101 001a 0003 6 Kristian Johannesen
2101 001a 0004 14 Jens Tøger Jensen
2101 001a 0004 14 Jens Tøger Jensen
I have read and reviewed the answer provided here, but couldn’t get my head around how to apply it for the structure I am working with, for two reasons: Our database combines multiple fields to create a quasi unique ID for each individual; and being that there is no single incremental number, I believe I could not apply the method provided in that excellent answer.
In summary
What I want to accomplish is to remove every duplicate. I am working locally in MS Access 2013.
Update: As the data is publicly available, hosted on our Oracle server, it is important that what I am doing locally is reproducible by others later. I do not, however, believe it would matter that the solution also works directly on Oracle, as anyone wanting to reproduce my findings would download the data and work locally as well.
Note
I currently don’t know where to start writing the code to select only one of each duplicate, thereafter deleting them, so I cannot provide a sample. I have tried running SELECT DISTINCT *
, but I end up with Access not responding; apparently there’s too much data to handle such a query, even when running it on the reduced sample set of just 12k.
Here is how I selected my duplicates:
SELECT
KOMMNR, KRETSNR, BOSTNR, PERSNR,
NYHUSH, FORNVN, ETTNVN,
BOSTAT, SEDVBO, ANTOPP, BYGNING,
KJONN, FAMST, SIVST, YRKE,
FAAR, FSTED, FSTED_KODE,
STATSB, TROSSMF, SYKDOM, SYKVAR,
BOSTNVN, FORNVNS, ETTNVNS,
PID
FROM [T3 3 FAAR rensket]
WHERE (
(([T3 3 FAAR rensket].KOMMNR) In (
SELECT KOMMNR FROM [T3 3 FAAR rensket] As Tmp
GROUP BY KOMMNR, KRETSNR, BOSTNR, PERSNR HAVING Count(*)>1
and KRETSNR = [T3 3 FAAR rensket].KRETSNR
and BOSTNR = [T3 3 FAAR rensket].BOSTNR
and PERSNR = [T3 3 FAAR rensket].PERSNR
)
)
)
ORDER BY KOMMNR, KRETSNR, BOSTNR, PERSNR;