20

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
odiszapc
  • 4,089
  • 2
  • 27
  • 42
  • look on this [How to compare two tables in postgres](http://stackoverflow.com/questions/4814597/how-to-compare-two-tables-in-postgres) and this to speed up the diff [How can I speed up a diff between tables?](http://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables) – static Mar 11 '13 at 02:49

4 Answers4

37

EXISTS seems like the best option.

tbl1 is the table with surplus rows in this example:

SELECT *
FROM   tbl1
WHERE  NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

SELECT *
FROM   tbl1
FULL   OUTER JOIN tbl2 USING (col)
WHERE  tbl2 col IS NULL OR
       tbl1.col IS NULL;

Overview over basic techniques in a later post:

Aside: The data type uuid is efficient for md5 hashes:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
5

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this

select *
from tbl1
full outer join tbl2 
    on row(tbl1) = row(tbl2)
where tbl1.col is null
or    tbl2.col is null
ThomasH
  • 22,276
  • 13
  • 61
  • 62
1

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:

SELECT * FROM A
WHERE NOT EXISTS (SELECT * FROM B) 
  UNION
SELECT * FROM B
WHERE NOT EXISTS (SELECT * FROM A)
testing_22
  • 2,340
  • 1
  • 12
  • 28
-2

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:

DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)

And then do the same the other way around for the other table.

0xCAFEBABE
  • 5,576
  • 5
  • 34
  • 59