Remove duplicate rows when multiple fields collectively provide unique ID

Question

I have a table for an historical census database in which four fields collectively identify each individual. The four fields represent different levels of governmental areas, making the table look like this:

KOMMNR  KRETSNR BOSTNR  PERSNR  FORNVN      ETTNVN
[mncpl] [area]  [rsdnc] [prsn]  [firstn]    [lastn] ← english
0101    001     0001    001     John        Doe
0101    001     0001    002     Richard     Doe
0101    001     0001    003     Johnny      Doe
0101    001     0002    001     Jane        Doe

As can be seen, individuals are identified by combining the four numbers (short text format) for municipality, area, residence and person. Personal ID numbers have been added to other databases, but are currently pending for implementation in this one.

The database I am currently working with holds roughly 900k posts, and I have ended up with ~12k duplicates, which I need to remove. Example from table [T3 3 clean]:

KOMMNR  KRETSNR BOSTNR  PERSNR  FORNVN          ETTNVN
[mncpl] [area]  [rsdnc] [prsn]  [firstn]        [lastn] ← english
0101    001a    0003    5       Ole Christian   Elingsen
0101    001a    0003    5       Ole Christian   Elingsen
1101    001a    0003    6       Kristian        Johannesen
1101    001a    0003    6       Kristian        Johannesen
2101    001a    0004    14      Jens Tøger      Jensen
2101    001a    0004    14      Jens Tøger      Jensen

I have read and reviewed the answer provided here, but couldn’t get my head around how to apply it for the structure I am working with, for two reasons: Our database combines multiple fields to create a quasi unique ID for each individual; and being that there is no single incremental number, I believe I could not apply the method provided in that excellent answer.

In summary

What I want to accomplish is to remove every duplicate. I am working locally in MS Access 2013.

Update: As the data is publicly available, hosted on our Oracle server, it is important that what I am doing locally is reproducible by others later. I do not, however, believe it would matter that the solution also works directly on Oracle, as anyone wanting to reproduce my findings would download the data and work locally as well.

Note

I currently don’t know where to start writing the code to select only one of each duplicate, thereafter deleting them, so I cannot provide a sample. I have tried running SELECT DISTINCT *, but I end up with Access not responding; apparently there’s too much data to handle such a query, even when running it on the reduced sample set of just 12k.

Here is how I selected my duplicates:

SELECT
KOMMNR, KRETSNR, BOSTNR, PERSNR,
NYHUSH, FORNVN, ETTNVN,
BOSTAT, SEDVBO, ANTOPP, BYGNING,
KJONN, FAMST, SIVST, YRKE,
FAAR, FSTED, FSTED_KODE,
STATSB, TROSSMF, SYKDOM, SYKVAR,
BOSTNVN, FORNVNS, ETTNVNS,
PID

FROM [T3 3 FAAR rensket]

WHERE (
    (([T3 3 FAAR rensket].KOMMNR) In (
        SELECT KOMMNR FROM [T3 3 FAAR rensket] As Tmp
      GROUP BY KOMMNR, KRETSNR, BOSTNR, PERSNR HAVING Count(*)>1
           and KRETSNR = [T3 3 FAAR rensket].KRETSNR
           and BOSTNR  = [T3 3 FAAR rensket].BOSTNR
           and PERSNR  = [T3 3 FAAR rensket].PERSNR
        )
    )
)

ORDER BY KOMMNR, KRETSNR, BOSTNR, PERSNR;

I removed the Oracle tag because your question explicitly mentions MS Access. — Gordon Linoff, Nov 08 '16 at 11:22
Gordon made the educated guess that you are looking for a VBA/MS Access solution, and not an Oracle SQL solution - correct? If that is in fact incorrect, and you are looking for an Oracle answer (unlikely but possible), you can edit your post again, add the `oracle` tag back, and state explicitly in your post that you are also interested in an Oracle answer. — , Nov 08 '16 at 11:25
You will need an actual primary key (a Autonumber column will be best). The database needs a way to identify the one record to delete (or to keep), and your combined unique id isn't actually unique. — Andre, Nov 08 '16 at 11:26
As you noted, your table doesn't have a separate id you can use to adapt the solution you saw elsewhere. Well, that's not necessarily so. In Oracle you have something called ROWID which you could use for that purpose. Or you can use analytic functions to create such unique identifiers on the fly, and use them for your purposes. — , Nov 08 '16 at 11:28
A general note towards the ongoing Oracle vs. Access confusion that your questions generate: The SQL dialects and features of Oracle and Access are different enough that you shouldn't ask for a solution that works on both. Concentrate on the DB engine you currently use, and if necessary ask a new question for the other system later (unless the solution can be easily adapted). — Andre, Nov 08 '16 at 11:32
Couldn't you create a copy of the table structure, turn the first four fields to a Primary Key and append the data to the new table? Duplicates will be removed due to key constraints. — Darren Bartrup-Cook, Nov 08 '16 at 11:43
As commented by several of you, the information about Oracle was irrelevant to the question; I have updated the question. — Canned Man, Nov 08 '16 at 12:28
@DarrenBartrup-Cook That could quite possibly be the easiest and fastest solution to the problem. — Canned Man, Nov 08 '16 at 12:28

score 0 · Answer 1 · answered Nov 08 '16 at 11:26

This is essentially your query:

SELECT r.*
FROM [T3 3 FAAR rensket] as r
WHERE r.KOMMNR In (SELECT KOMMNR
                   FROM [T3 3 FAAR rensket] As r2
                   GROUP BY KOMMNR, KRETSNR, BOSTNR, PERSNR
                   HAVING Count(*) > 1 AND
                          r2.KRETSNR = r.KRETSNR AND
                          r2.BOSTNR  = r.BOSTNR AND
                          r2.PERSNR  = r.PERSNR
                 )
ORDER BY KOMMNR, KRETSNR, BOSTNR, PERSNR;

First advice: Moving the correlation clauses to the WHERE clause:

SELECT r.*
FROM [T3 3 FAAR rensket] as r
WHERE r.KOMMNR In (SELECT r2.KOMMNR
                   FROM [T3 3 FAAR rensket] As r2
                   WHERE r2.KRETSNR = r.KRETSNR AND
                         r2.BOSTNR  = r.BOSTNR AND
                         r2.PERSNR  = r.PERSNR
                   GROUP BY KOMMNR, KRETSNR, BOSTNR, PERSNR
                   HAVING Count(*) > 1
                 )
ORDER BY KOMMNR, KRETSNR, BOSTNR, PERSNR;

Second, add an index on [T3 3 FAAR rensket](KRETSNR, BOSTNR, PERSNR, KOMMNR).

See if these help performance.

I’ve tried this, but it simply takes too long. It might just be that I end up copying the entire thing to Excel, sort it out there (VBA?) and then reintroduce it to the DB. Downer, though … I thought that there could be an SQL solution for the problem. — Canned Man, Nov 09 '16 at 12:03

Canned Man · Accepted Answer · 2016-11-09T15:00:31.520

As suggested by @DarrenBartrup-Cook in the comments, the easiest way was to do the following:

(Create a copy of the structure of the table)[http://www.techrepublic.com/article/copy-an-existing-table-structure-into-a-new-access-database/];
make the four keys KOMMNR, KRETSNR, BOSTNR and PERSNR primary keys (open the structure table in design view, select the four rows for these keys, and then click the primary key button);
and then create an append query to insert the data into the new table.

After performing these steps, I got, as hoped for, 5926 less posts than the original table, ominously reported through an error message.

It would seem that the reason I got duplicate entries – despite having created the table with numerous appends – was that I did not have any actual primary keys. Without the primary keys, Access had no way of knowing whether two fields were duplicates of each other

I ran a count to confirm the result (provided courtesy of Karl-Erlend Mikalsen (quite similar to the solution suggested by @GordonLinoff)):

SELECT x.KOMMNR,
       x.KRETSNR,
       x.BOSTNR,
       x.PERSNR,
       x.antall
FROM ( SELECT A.KOMMNR,
              A.KRETSNR,
              A.BOSTNR,
              A.PERSNR,
              COUNT (*) AS antall
         FROM [T3 3 FAAR rensket] A
     GROUP BY A.KOMMNR,
              A.KRETSNR,
              A.BOSTNR,
              A.PERSNR) AS X
WHERE x.antall > 1;

(Note: This requires a dummy table named ‘x’ to work.

As expected (and dearly hoped for) this yielded 0 posts, whereas the same query but with the final line changed to WHERE x.antall = 1; yielded the expected 891724 posts (versus 894262 posts in the original original table, which is as expected).

Remove duplicate rows when multiple fields collectively provide unique ID

In summary

Note

2 Answers2