0

Say I have a table like this, where no column or combination of columns is guaranteed to be unique:

GAME_EVENT USERNAME ITEM QUANTITY
sell poringLUVR sword 1
sell poringLUVR sword 1
kill daenerys civilians 200000
kill daenerys civilians 200000
invoke sylvanas undead 1000000

And I want to retrieve the list of all rows that exist more than once (where the combination of ALL their columns appears more than once).

(In this case I would expect to get a list with the "sell/poringLUVR" and "kill/daenerys" rows)

What would be a good way of approaching this? Would a combined index be of any help? Suggestions for non-Postgres approaches are also welcome.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Z. M.
  • 329
  • 5
  • 13
  • Your actual table definition is always instrumental for any such question (would tell us about `NOT NULL` constraints, among other things). And always your version of Postgres. – Erwin Brandstetter May 16 '21 at 23:44

1 Answers1

2

Assuming all columns NOT NULL, this will do:

SELECT *
FROM   tbl t1
WHERE  EXISTS (
   SELECT FROM tbl t2
   WHERE  (t1.*) = (t2.*)
   AND    t1.ctid <> t2.ctid
   );

ctid is a system column, the "tuple identifier" / "item pointer" that can serve as poor-man's PK in the absence of an actual PK (which you obviously don't have), and only within the scope of a single query. Related:

If columns can be NULL, (more expensively) operate with IS NOT DISTINCT FROM instead of =. See:

(t1.*) = (t2.*) is comparing ROW values. This shorter syntax is equivalent: t1 = t2 unless a column of the same name exists in the underlying tables, in which case the second form fails while the first won't. See:

Index?

If any of the columns has a particularly high cardinality (many distinct values, few duplicates), let's call it hi_cardi_column for the purpose of this answer, a plain btree index on just that column can be efficient for your task. A combination of a few, small columns with a multicolumn index can work, too. The point is to have a small, fast index or the overhead won't pay.

SELECT *
FROM   tbl t1
WHERE  EXISTS (
   SELECT FROM tbl t2
   WHERE  t1.hi_cardi_column = t2.hi_cardi_column -- logically redundant
   AND    (t1.*) = (t2.*)
   AND    t1.ctid <> t2.ctid
   );

The added condition t1.hi_cardi_column = t2.hi_cardi_column is logically redundant, but helps to utilize said index.

Other than that I don't see much potential for index support as all rows of the table have to be visited anyway, and all columns have to be checked.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228