There are two tables linked by an id:
item_tbl (id)
link_tbl (item_id)
There are some records in item_tbl
that don't have matching rows in link_tbl
. A select which would count their amount would be:
SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL
I would like to delete those orphan records (those which don't have match in the other table) from link_tbl
but the only way I could think of was:
DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
There are
262,086,253 records in link_tbl
3,033,811 in item_tbl
16,844,347 orphan records in link_tbl
.
The server has 4GB RAM and 8 core CPU.
EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)
Returns:
Delete on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
-> Seq Scan on link lnk (cost=0.00..11395249378057.98 rows=131045918 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..79298.10 rows=3063207 width=4)
-> Seq Scan on item itm (cost=0.00..52016.07 rows=3063207 width=4)
The questions are:
- Is there any better way how to delete orphan records from
link_tbl
? How accurate is the explain above, or how long it could take to delete those records?
- Edit: fixed according to Erwin Brandstetter comment.
- Edit: PostgreSql version is 9.1
- Edit: some parts of postgresql.config
- shared_buffers = 368MB
- temp_buffers = 32MB
- work_mem = 32MB
- maintenance_work_mem = 64MB
- max_stack_depth = 6MB
- fsync = off
- synchronous_commit = off
- full_page_writes = off
- wal_buffers = 16MB
- wal_writer_delay = 5000ms
- commit_delay = 10
- commit_siblings = 10
- effective_cache_size = 1600MB
- Edit: fixed according to Erwin Brandstetter comment.
Resolution:
Thank you all for your advices, it was very helpful. I finally used the delete advised by Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340 but I tweaked it a little:
DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
WHERE itm.id BETWEEN 0 AND 10000)
I compared results for NOT IN and NOT EXISTS and the output is below, although I used COUNT instead of DELETE which I think should be the same (I mean in sake of relative comparison):
EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 20000
AND lnk.item_id NOT IN (SELECT itm.id
FROM item_tbl itm
WHERE itm.id BETWEEN 0 AND 20000);
QUERY PLAN
Aggregate (cost=6002667.56..6002667.57 rows=1 width=0) (actual time=226817.086..226817.088 rows=1 loops=1)
-> Seq Scan on link_tbl lnk (cost=1592.50..5747898.65 rows=101907564 width=0) (actual time=206.029..225289.570 rows=566625 loops=1)
Filter: ((item_id >= 0) AND (item_id <= 20000) AND (NOT (hashed SubPlan 1)))
SubPlan 1
-> Index Scan using item_tbl_pkey on item_tbl itm (cost=0.00..1501.95 rows=36221 width=4) (actual time=0.056..99.266 rows=17560 loops=1)
Index Cond: ((id >= 0) AND (id <= 20000))
Total runtime: 226817.211 ms
EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk WHERE lnk.item_id>0 AND lnk.item_id<20000
AND NOT EXISTS (SELECT 1 FROM item_tbl itm WHERE itm.id=lnk.item_id);
QUERY PLAN
Aggregate (cost=8835772.00..8835772.01 rows=1 width=0)
(actual time=1209235.133..1209235.135 rows=1 loops=1)
-> Hash Anti Join (cost=102272.16..8835771.99 rows=1 width=0)
(actual time=19315.170..1207900.612 rows=566534 loops=1)
Hash Cond: (lnk.item_id = itm.id)
-> Seq Scan on link_tbl lnk (cost=0.00..5091076.55 rows=203815128 width=4) (actual time=0.016..599147.604 rows=200301872 loops=1)
Filter: ((item_id > 0) AND (item_id < 20000))
-> Hash (cost=52016.07..52016.07 rows=3063207 width=4) (actual time=19313.976..19313.976 rows=3033811 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 26672kB
-> Seq Scan on item_tbl itm (cost=0.00..52016.07 rows=3063207 width=4) (actual time=0.013..9274.158 rows=3033811 loops=1)
Total runtime: 1209260.228 ms
NOT EXISTS was 5 times slower.
The actual delete of the data didn't take so long as I was worried, I was able to delete it in 5 batches (10000-20000,20000-100000,100000-200000,200000-1000000 and 1000000-1755441). At first I found out max item_id and I only had to went through half of the table.
When I tried NOT IN or EXISTS without the range (with select count) it didn't even finish, I let it run during the night and it was still running in the morning.
I think I was looking for DELETE with USING from wildplasser's answer https://stackoverflow.com/a/15988033/1331340 but it came too late.
DELETE FROM one o
USING (
SELECT o2.id
FROM one o2
LEFT JOIN two t ON t.one_id = o2.id
WHERE t.one_id IS NULL
) sq
WHERE sq.id = o.id
;