Find difference between two big tables in PostgreSQL

Question

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

look on this [How to compare two tables in postgres](http://stackoverflow.com/questions/4814597/how-to-compare-two-tables-in-postgres) and this to speed up the diff [How can I speed up a diff between tables?](http://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables) — static, Mar 11 '13 at 02:49

Erwin Brandstetter · Answer 1 · 2020-08-25T22:59:07.430

EXISTS seems like the best option.

tbl1 is the table with surplus rows in this example:

SELECT *
FROM   tbl1
WHERE  NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

SELECT *
FROM   tbl1
FULL   OUTER JOIN tbl2 USING (col)
WHERE  tbl2 col IS NULL OR
       tbl1.col IS NULL;

Overview over basic techniques in a later post:

Select rows which are not present in other table

Aside: The data type uuid is efficient for md5 hashes:

score 5 · Answer 2 · answered Mar 20 '18 at 10:20

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this

select *
from tbl1
full outer join tbl2 
    on row(tbl1) = row(tbl2)
where tbl1.col is null
or    tbl2.col is null

score 1 · Answer 3 · answered Nov 10 '21 at 22:26

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:

SELECT * FROM A
WHERE NOT EXISTS (SELECT * FROM B) 
  UNION
SELECT * FROM B
WHERE NOT EXISTS (SELECT * FROM A)

score -2 · Answer 4 · answered Mar 11 '13 at 07:45

-2

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:

DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)

And then do the same the other way around for the other table.

answered Mar 11 '13 at 07:45

0xCAFEBABE

5,576
5
34
59

Note that `NOT IN` is different in principal from `NOT EXISTS`. `NULL` handling is different, which makes `NOT IN` more expensive. – Erwin Brandstetter Mar 11 '13 at 16:59

Find difference between two big tables in PostgreSQL

4 Answers4