Find rows with all columns duplicated and no unique field in PostgreSQL

Question

Say I have a table like this, where no column or combination of columns is guaranteed to be unique:

GAME_EVENT	USERNAME	ITEM	QUANTITY
sell	poringLUVR	sword	1
sell	poringLUVR	sword	1
kill	daenerys	civilians	200000
kill	daenerys	civilians	200000
invoke	sylvanas	undead	1000000

And I want to retrieve the list of all rows that exist more than once (where the combination of ALL their columns appears more than once).

(In this case I would expect to get a list with the "sell/poringLUVR" and "kill/daenerys" rows)

What would be a good way of approaching this? Would a combined index be of any help? Suggestions for non-Postgres approaches are also welcome.

Your actual table definition is always instrumental for any such question (would tell us about `NOT NULL` constraints, among other things). And always your version of Postgres. — Erwin Brandstetter, May 16 '21 at 23:44

Erwin Brandstetter · Accepted Answer · 2021-05-17T00:07:30.680

Assuming all columns NOT NULL, this will do:

SELECT *
FROM   tbl t1
WHERE  EXISTS (
   SELECT FROM tbl t2
   WHERE  (t1.*) = (t2.*)
   AND    t1.ctid <> t2.ctid
   );

ctid is a system column, the "tuple identifier" / "item pointer" that can serve as poor-man's PK in the absence of an actual PK (which you obviously don't have), and only within the scope of a single query. Related:

If columns can be NULL, (more expensively) operate with IS NOT DISTINCT FROM instead of =. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

(t1.*) = (t2.*) is comparing ROW values. This shorter syntax is equivalent: t1 = t2 unless a column of the same name exists in the underlying tables, in which case the second form fails while the first won't. See:

SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Index?

If any of the columns has a particularly high cardinality (many distinct values, few duplicates), let's call it hi_cardi_column for the purpose of this answer, a plain btree index on just that column can be efficient for your task. A combination of a few, small columns with a multicolumn index can work, too. The point is to have a small, fast index or the overhead won't pay.

SELECT *
FROM   tbl t1
WHERE  EXISTS (
   SELECT FROM tbl t2
   WHERE  t1.hi_cardi_column = t2.hi_cardi_column -- logically redundant
   AND    (t1.*) = (t2.*)
   AND    t1.ctid <> t2.ctid
   );

The added condition t1.hi_cardi_column = t2.hi_cardi_column is logically redundant, but helps to utilize said index.

Other than that I don't see much potential for index support as all rows of the table have to be visited anyway, and all columns have to be checked.

Find rows with all columns duplicated and no unique field in PostgreSQL

1 Answers1

Index?

Linked