0

I am trying to write a method that removes duplicates from tables, without having to know the details of the table for generality (i.e., it should run on any table).

I am using the following method from here (last method) through psycopg2:

CREATE TABLE tempTable (LIKE "{table}");
INSERT INTO tempTable(*)
  SELECT
    DISTINCT ON ("{column}") *
    FROM "{table}";
DROP TABLE "{table}";
ALTER TABLE tempTable
  RENAME TO "{table}";
DROP table tempTable

The problem is that you can't use * to get all columns inside the INSERT INTO command. We are expected to specify a list of all the columns to insert, which is probably good for control, but is bad for my purpose here.

I know I can get the list of column names for the table with

SELECT column_name FROM information_schema.columns WHERE table_name = "{table}"

but I can't just plug that as a subquery in place of the *.

Answers to this 8 year old question state that it's just not possible in SQL. That might still be true, or it might not be true.

Can this be done now?

Aaron Bramson
  • 1,176
  • 3
  • 20
  • 34
  • Why don't you just delete the duplicates? You can use VACUUM FULL or CLUSTER afterwards to shrink the table, if needed. – Frank Heikens Apr 06 '23 at 12:05
  • Yes, I ran this by creating a temporary uniqueID column, remove duplicates with that as an index, then removing that column. It was plenty fast but less elegant. If there is a an easier way with these commands you mention, please post it as an answer. – Aaron Bramson Apr 07 '23 at 00:57
  • Valid point @FrankHeikens, the CTAS approach is meaninfull only with a large number of duplicates or if also a table reorganisation is required. For in-place delete see [here](https://stackoverflow.com/q/6583916/4808122) – Marmite Bomber Apr 07 '23 at 06:10

1 Answers1

2

Your link points to MySQLdatabase, in PostgreSQL is perfectly valid to use insert without the column list.

INSERT INTO tempTable 
  SELECT
    DISTINCT ON (a) *
    FROM t;

It is not the best practice (as this may cause problems if the column order is sudenly not as expected), but for your use case where the target table is created with like from the source table will work.

Two additional points

  • You may want to add order by in the select distinct to enable a deterministic outcome (which duplicated row(s) will be removed and which preserve).

  • the final DROP table tempTable is misplaced as this table does not exists anymore as beeing renamed in the previous statement...

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53