6

I have a PostgreSQL database which has a table with the primary key applied to three columns. According to the database, there is an index on the key:

Indexes:
    "full_log_pkey" PRIMARY KEY, btree (server_name, line_number, log_generation)

Yet some simple tests reveal that I have duplicate keys:

select count(*) from full_log;
  count
----------
 60644405

select count(*) from 
    (select distinct server_name, 
                     line_number, 
                     log_generation 
            from     full_log) as foo;
  count
----------
 60636564

Clearly there are fewer distinct rows (based on the primary key) than there are rows. My question is, how is this possible?

Edit: The full table definition is such:

                 Table "public.full_log"
     Column     |            Type             | Modifiers
----------------+-----------------------------+-----------
 activity       | character(1)                |
 archivaldate   | timestamp without time zone |
 media_type     | character varying(5)        |
 vsn            | text                        |
 archive_set    | character varying(20)       |
 copy           | smallint                    |
 file_start     | integer                     |
 file_offset    | integer                     |
 fs_name        | character varying(20)       |
 inode          | double precision            |
 file_length    | bigint                      |
 file_type      | character(1)                |
 overflow       | integer                     |
 device_number  | integer                     |
 server_name    | text                        | not null
 path           | text                        |
 line_number    | integer                     | not null
 log_generation | integer                     | not null
Indexes:
    "full_log_pkey" PRIMARY KEY, btree (server_name, line_number, log_generation)
Foreign-key constraints:
    "full_log_server_name_fkey" FOREIGN KEY (server_name) REFERENCES servers(server_name)
Rules:
    insert_update_full_log AS
    ON INSERT TO full_log
   WHERE (EXISTS ( SELECT full_log.activity, full_log.archivaldate, full_log.media_type, full_log.vsn, full_log.archive_set, full_log.copy, full_log.file_start, full_log.file_offset, full_log.fs_name, full_log.inode, full_log.file_length, full_log.file_type, full_log.overflow, full_log.device_number, full_log.server_name, full_log.path, full_log.line_number, full_log.log_generation
           FROM full_log
          WHERE full_log.server_name = new.server_name AND full_log.line_number = new.line_number AND full_log.log_generation = new.log_generation)) DO INSTEAD  UPDATE full_log SET activity = new.activity, archivaldate = new.archivaldate, media_type = new.media_type, vsn = new.vsn, archive_set = new.archive_set, copy = new.copy, file_start = new.file_start, file_offset = new.file_offset, fs_name = new.fs_name, inode = new.inode, file_length = new.file_length, file_type = new.file_type, overflow = new.overflow, device_number = new.device_number, path = new.path
  WHERE full_log.server_name = new.server_name AND full_log.line_number = new.line_number AND full_log.log_generation = new.log_generation

For an example of duplicate rows:

 select * from full_log where line_number = 6332986;
 activity |    archivaldate     | media_type |  vsn   | archive_set | copy | file_start | file_offset | fs_name |   inode    | file_length | file_type | overflow | device_number | server_name |                                           path                                            | line_number | log_generation
----------+---------------------+------------+--------+-------------+------+------------+-------------+---------+------------+-------------+-----------+----------+---------------+-------------+-------------------------------------------------------------------------------------------+-------------+----------------
 A        | 2010-10-13 10:49:49 | ti         | Z00711 | lcbp_rel    |    1 |     226237 |      779099 | lcbp    | 21798068.3 |    31198108 | f         |        0 |          8511 | redact      | wdl/delivery/irishparis_2010_09/MSE2_Histoire des rois d'Angleterre/MSE2_239.TIF          |     6332986 |              1
 A        | 2010-10-13 10:49:49 | ti         | Z00711 | lcbp_rel    |    1 |     226237 |      779099 | lcbp    | 21798068.3 |    31198108 | f         |        0 |          8511 | redact      | wdl/delivery/irishparis_2010_09/MSE2_Histoire des rois d'Angleterre/MSE2_239.TIF          |     6332986 |              1
(2 rows)
astine
  • 276
  • 2
  • 9
  • PS. It's worth noting that I discovered this issue when attempting to alter the type of one of the table columns. This column is not part of the primary key. Postgres drops and recreates the index when this happens and it's triggering and error when I do so. – astine Apr 22 '11 at 20:27
  • 1
    Any chance of digging out some of these duplicated rows? What happens if you use a GROUP BY rather than DISTINCT? What does the full table schema look like? – mu is too short Apr 22 '11 at 20:32
  • @mu I've added the full table schema. The insert rule replaces inserts where the primary key already exists with nearly identical updates. – astine Apr 22 '11 at 20:40
  • Complete guess but is there any chance trailing spaces in strings might get treated differently in the 2 contexts? – Martin Smith Apr 22 '11 at 20:42
  • some weired unicode characters/combinations which are considered the equal, but a different on the byte level??? – Jens Schauder Apr 22 '11 at 20:54
  • @Martin @ Jens I don't think so. `line_number` and `log_generation` are both integers and `server_name` only has one value throughout the whole table. It doesn't have spaces or any special characters beyond ASCII. – astine Apr 22 '11 at 21:03
  • If `server_name` has only one value, is it not true that {line_number, log_generation} must be unique for a primary key constraint on {server_name, line_number, log_generation} to hold? – Mike Sherrill 'Cat Recall' Apr 22 '11 at 21:28
  • @Catcall: Yes, if `server_name` only has one value then it adds nothing to the uniqueness of the triples. – mu is too short Apr 22 '11 at 21:38
  • Then `UNIQUE (line_number, log_generation)` should fail, too, right? Worth a try . . . – Mike Sherrill 'Cat Recall' Apr 22 '11 at 22:00
  • It's version 8.4, I don't remember the revision number. It has undergone at least one upgrade, but only a minor version (8.4 both before and after.) – astine Apr 23 '11 at 01:40
  • Not directly related to your question, I would use an auto-serial as primary key too, because it is often more handy to specify a single value, than to use 3 comparisions. Maybe this would have prevented your problem, but of course such problems shouldn't happen at all, using a combined primary key. But now it can help you to delete some of the rows `(t1.a == t2.a AND t1.b==t2.b AND t1.c==t2.c AND t1.s < t2.s)` – user unknown May 08 '11 at 16:00

4 Answers4

4

What does this query return?

select server_name, line_number, log_generation 
from full_log
group by server_name, line_number, log_generation
having count(*) > 1

It might help to compare that to

select line_number, log_generation 
from full_log
group by line_number, log_generation
having count(*) > 1

but it might not. I think this clause

WHERE (EXISTS ( SELECT full_log.activity, 
                       full_log.archivaldate, 
                       full_log.media_type, 
                       full_log.vsn, 
                       full_log.archive_set, 
                       full_log.copy, 
                       full_log.file_start, 
                       full_log.file_offset, 
                       full_log.fs_name, 
                       full_log.inode, 
                       full_log.file_length, 
                       full_log.file_type, 
                       full_log.overflow, 
                       full_log.device_number, 
                       full_log.server_name, 
                       full_log.path, 
                       full_log.line_number, 
                       full_log.log_generation
               FROM full_log
               WHERE full_log.server_name = new.server_name 
                 AND full_log.line_number = new.line_number 
                 AND full_log.log_generation = new.log_generation)) 

can be simplified to this clause. (Although I don't think this contributes to the problem.)

WHERE (EXISTS ( SELECT full_log.server_name, 
                       full_log.line_number, 
                       full_log.log_generation
                FROM full_log
                WHERE full_log.server_name = new.server_name 
                  AND full_log.line_number = new.line_number 
                  AND full_log.log_generation = new.log_generation)) 

You said PostgreSQL drops and recreates the index when when you alter the data type of a non-key column. I don't see that happening here, and I'm not sure I've ever seen that happen. I might not have noticed if the change succeeded, and I don't routinely change the data type of a column. (Now that I've said that, I couldn't begin to tell you the last time I did that.) I now have PostgreSQL 9.0.2 here.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • I've run the command but it will be a bit; the database is rather large. – astine Apr 22 '11 at 21:30
  • I've run the command and it returned 7841 rows. I can't see anything special about them at first glance. – astine Apr 22 '11 at 21:36
  • @astine - Can you pick one and locate the corresponding matching rows for it and examine those carefully (perhaps by casting to binary to spot any subtle differences) – Martin Smith Apr 22 '11 at 21:39
  • @Martin They appear to be equal non the less. (That is the keys appear equal, they are totally different rows.) The only value which differentiates the rows with duplicate keys is line_number which is an integer. I'm not sure what alternate representation could be at fault with them. – astine Apr 22 '11 at 21:47
  • @astine: Could you include a sample of the duplicates in your question? This is a very interesting problem BTW. – mu is too short Apr 22 '11 at 21:58
  • 2
    60636564 + 7841 = 60644405. At least your database can still do arithmetic correctly. (Small victory.) – Mike Sherrill 'Cat Recall' Apr 22 '11 at 22:03
  • @mu I can but it will have to wait until monday. @Catcall Yes, at least that. – astine Apr 23 '11 at 01:34
1

Are you using table inheritance? If so, the PK is not enforced across children (at least, not in 8.2).

Andrew Lazarus
  • 18,205
  • 3
  • 35
  • 53
0

i would suspect NULL values. maybe query to see if there are any.

Randy
  • 16,480
  • 1
  • 37
  • 55
0

I've seen this occur when the ctid of a target row gets changed as a result of a before trigger. Can't recall the exact steps, but that was basically the problem.

If you've triggers in there, my answer on this related topic might be what you're running into:

What are PostgreSQL RULEs good for?

The takeaway is: if, when you issue insert/update/delete statements, postgres doesn't return the correct number of affected rows, you may run into oddities. A rule (or using an after trigger instead) can then help to get rid of them.

Community
  • 1
  • 1
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154