Postgresql: faster alternative to SELECT DISTINCT ON

Question

I have a problem with copying filtered data from one table to another (with similar structure). Basically, I have one big table with about 11 billion rows, and I'd like to copy to another table records that are unique based on one of the columns (called "fingerprint"). I tried something like this:

INSERT INTO table2 SELECT DISTINCT ON (fingerprint) * FROM table1;

It generally works, but it's slow; I tested it on a smaller table consisting of 800 000 rows and it took nearly an hour. I tried changing INSERT to a pair of COPY (with binary option) statements but it was even worse somehow...

I run the psql server on Linux LMDE, i5-2410m with 6gigs of RAM. I dunno if it should take so long and whether there's a way to improve that. I'd appreciate any hints.

Have you seen [Whats the fastest way to do a bulk insert into Postgres?](http://stackoverflow.com/q/758945/74757)? — Cᴏʀʏ, Dec 01 '12 at 19:14
You've omitted your most important stat for your server: The disk subsystem. Is it on a good-quality SSD or array of SSDs? A RAID controller with write-back battery backed cache? What RAID level? Also, PostgreSQL version? — Craig Ringer, Dec 02 '12 at 08:04
`DISTINCT ON` selects an arbitrary row - don't you want to specify which of the rows gets chosen? Do you have any index on `fingerprint`? What's the query plan (see `EXPLAIN ANALYZE`) of the `SELECT DISTINCT ...` ? `EXPLAIN` your query and paste to http://explain.depesz.com/ ; also do an `EXPLAIN ANALYZE` on a version with less data so it completes and gives you a result. — Craig Ringer, Dec 02 '12 at 08:07

Postgresql: faster alternative to SELECT DISTINCT ON

0 Answers0