Adding a multi-column primary key to a table with 40 million records

Question

I'm working on maintaining a database which stores data transfer information between different networks. Essentially, each data transfer is logged and at the end of each month I run a perl script that loads the log files into a table in the database. I did not design the perl script or the database schema. It was done before I started working on this project.

I used this link to retrieve the primary keys of the table (usage_detail is the name of the table) and it gave me nothing. Since, there are so many records in the table, its not very easy to keep track of duplicates. We've had problems where a lot of duplicates were loaded (because of bugs script that does the logging for each transfer but thats for another topic) and ended up having to drop the latest load and reload all the new ones after fixing the log files. As you may have guessed this is stupid and tedious.

To fix this, I would like to add a primary key to the table. Due to several reasons, we don't want to add an entire new column for the primary keys. After looking at the fields, I've figured out a multi-column primary key. Basically it consists of: transfer start timestamp, transfer end timestamp, name of file transferred (which also includes the entire path). It seems highly unlikely that there would be two records which have those fields the same.

Here are my questions: 1) If I add this primary key in the table, what would happen to any duplicates that might already be present in the table?

2) How would I actually add this primary key to the table (we are using PostgreSQL 8.1.22).

3) After the primary key is added, lets say while the load script is running it tries to load a duplicate. What sort of error would PostgreSQL throw? Would I be able to catch it in the script?

4) I know you don't have much information about the load script, but given the information that I have provided do you foresee something that might need to changed in the script?

Any help is greatly appreciated. Thanks.

score 3 · Accepted Answer · edited May 23 '17 at 12:12

Use a serial column

Your plan is to add a needlessly huge index for 40 million (!) rows. And you aren't even sure it's going to be unique. I would strongly advice against that route of action. Add a serial column instead and be done with it:

ALTER TABLE tbl ADD COLUMN tbl_id serial PRIMARY KEY;

That's all you need to do. The rest happens automatically. More in the manual or in these closely related answers:
PostgreSQL primary key auto increment crashes in C++
Auto increment SQL function

Adding a serial column is one-time operation, but expensive. The whole table has to be rewritten, blocking updates for the duration of the operation. Best done without concurrent load at off hours. I quote the manual here:

Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.

Since this effectively rewrites the whole table, you might as well create a new table with a serial pk column, insert all rows from the old table, letting the serial fill with default values from its sequence, drop the old and rename the new. More in these closely related answers:
Updating database rows without locking the table in PostgreSQL 9.2
Add new column without table lock?

Make sure all your INSERT statements have a target list, then an additional column can't confuse them:

INSERT INTO tbl (col1, col2, ...) VALUES ...

Not:

INSERT INTO tbl VALUES ...

A serial is implemented with an integer column (4 bytes).
A primary key constraint is implemented with a unique index and a NOT NULL constraint on the involved columns.
The contents of an index are stored much like tables. Additional physical storage is needed separately. More about physical storage in this related answer:
Calculating and saving space in PostgreSQL

Your index would include 2 timestamps (2 x 8 bytes) plus a lengthy filename incl. path (~ 50 bytes?) That would make the index around 2.5 GB bigger (40M x 60 .. something bytes) and all operations slower.

Dealing with duplicates

How to deal with "importing duplicates" depends on how you are importing data and how "duplicate" is defined exactly.

If we are talking about COPY statements, one way would be to use a temporary staging table and collapse duplicates with a simple SELECT DISTINCT or DISTINCT ON in the INSERT command:

CREATE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl LIMIT 0;     -- copy structure without data and constraints

COPY tbl_tmp FROM '/path/to/file.csv';

INSERT INTO tbl (col1, col2, col3)
SELECT DISTINCT ON (col1, col2)
       col1, col2, col3 FROM tbl_tmp;

Or, to also prohibit duplicates with already existing rows:

INSERT INTO tbl (col1, col2, col3)
SELECT i.*
FROM  (
   SELECT DISTINCT ON (col1, col2)
          col1, col2, col3
   FROM   tbl_tmp
   ) i
LEFT   JOIN tbl t USING (col1, col2)
WHERE  t.col1 IS NULL;

The temp. table is dropped at the end of the session automatically.

But the proper fix would be to deal with the root of the error that produces duplicates in the first place.

Original question

1) You could not add the pk at all, if there is a single duplicate over all columns.

2) I would only touch a PostgreSQL database version 8.1 with a five-foot pole. It's hopelessly ancient, outdated and inefficient, not supported any more and probably has a number of unfixed security holes. Official Postgres versioning site.
@David already supplied the SQL statement.

3 & 4) A duplicate key violation. PostgreSQL throwing an error also means the whole transaction is rolled back. Catching that in a perl script cannot make the rest of the transaction go through. You would have to create a server-side script with plpgsql for instance, where you can catch exceptions.

Erwin, one question. If the OP doesn't have a unique constraint, how is he going to fix "We've had problems where a lot of duplicates were loaded". A serial field will just give each one of the dups a unique "id", but it won't prevent them from being loaded. — David S, Jun 13 '13 at 14:53
@DavidS: The serial column doesn't help with those dupes per se. How to fix the problem with duplicates depends on how "duplicate" is defined exactly and how those imports happen. I added a bit to my answer. — Erwin Brandstetter, Jun 13 '13 at 15:12
I really like the suggestion with the temp table and select distinct. — David S, Jun 13 '13 at 17:04
This is really helpful, thank you very much. I did not realize that adding a primary with columns that are already present in the table would need more space. Is this serial like an standard integer? Like would I have record 1 through record 40000000 ? As mentioned earlier, I will urge the appropriate people to upgrade PosgreSQL version. — shaun, Jun 13 '13 at 19:39
@user1338584: I wrote more about pk constraints, adding a column and physical storage. To learn about serial columns, follow the link to the manual I provided. I added some more. — Erwin Brandstetter, Jun 13 '13 at 23:01
Quick question: Is there are a way to find whether the a table has duplicates (and if it has get select those duplicates)? I saw a query (don't have the reference) that has a "HAVING COUNT(*) > 1" or similar clause at the end. But when I ran this query, it just kept running for like 2 hours before I canceled it. If I just want to see if there are duplicates, I am guessing I can count the number of rows and count the number of distinct rows and see the difference. Would that work? — shaun, Jun 14 '13 at 10:52
@user1338584: That would work, but wouldn't be much faster. The fastest way to see if there are any duplicates would be an `EXISTS` query that can stop after the first find. Start a new question for this new question. You'll get answers in no time. — Erwin Brandstetter, Jun 14 '13 at 11:27

score 2 · Answer 2 · answered Jun 13 '13 at 14:13

2

you won't be able to add it with duplicates. You will need to remove the duplicates first.
ALTER TABLE foo ADD CONSTRAINT foo_pkey PRIMARY KEY(fld1, fld2);
PostgreSQL will give you a error with a code of SQL state: 23505. I don't know anything about perl, but I would assume you could trap this.
Again, don't know anything about perl, but I would assume you can catch the error and then it's up to you to decide what to do about it.

Note: you are on an unsupported version of PostgreSQL (you should probably upgrade). I couldn't even test this on SqlFiddle. So, all answers are given based on PosgreSQL 9.1

answered Jun 13 '13 at 14:13

David S

12,967
12
55
93

When you say I wont be able to add the PK what exactly do you mean? I mean, if I do run the ALTER TABLE with duplicates inside the table, it will just give me an error? As for the PosgreSQL version, its really not in my hands, but I will certainly let the correct people know about upgrading. – shaun Jun 13 '13 at 19:33
Yes; you will get an error. The value(s) in a primary key field(s) must be unique. Note: this is an easy thing to test. You can create a temp table and insert some records and see what happens. – David S Jun 13 '13 at 20:44

Adding a multi-column primary key to a table with 40 million records

2 Answers2

Use a serial column

Dealing with duplicates

Original question

Linked