-1

I want to delete people that aren't present in events or photos or email subscribers. Maybe they were, but the only photo they're tagged in gets deleted, or the event they were at gets purged from the database.

Two obvious options:

1)

DELETE FROM people

WHERE personPK NOT IN (
    SELECT personFK FROM attendees
        UNION
    SELECT personFK FROM photo_tags
        UNION
    SELECT personFK FROM email_subscriptions
)

2)

DELETE people FROM people

LEFT JOIN attendees A on A.personFK = personPK
LEFT JOIN photo_tags P on P.personFK = personPK
LEFT JOIN email_subscriptions E on E.personFK = personPK

WHERE attendeePK IS NULL
  AND photoTagPK IS NULL
  AND emailSubPK IS NULL

Both A & P are about a million rows apiece, and E a few thousand.

The first option works fine, taking 10 seconds or so.

The second option times out.

Is there a cleverer, better, faster third option?

Codemonkey
  • 4,455
  • 5
  • 44
  • 76
  • This may be a stupid question, but if the 1st option works, could you use it? Is it just not fast enough? If so, what speed are you looking for? – raphael75 Oct 04 '16 at 17:39
  • are you sure you want to do this? What if a delete happens but then you want to re-hang a new photo? – Drew Oct 04 '16 at 17:40
  • I'm sure Drew. This is a half-fictitious example, but my use case is solid. – Codemonkey Oct 04 '16 at 17:41
  • how about if you load up a boat load of data and see if option 1 performs poorly. At cursory glance I see that one as being awful. You could do it with a select count(*) and not an actual delete to get an idea. I keep around huge tables to test my crazy ideas out on. – Drew Oct 04 '16 at 17:42
  • what about people size? Performance questions should include `EXPLAIN ANALYZE` and some information about table size, index, current time performance, desire time, etc. `Slow` is a relative term and we need a real value to compare. [**MySQL**](http://dba.stackexchange.com/questions/15371/how-do-i-get-the-execution-plan-for-a-view) – Juan Carlos Oropeza Oct 04 '16 at 17:53
  • I doubt it would be any faster but you could try a `NOT EXISTS` version of your query; also, have you tried `AND`ing three `NOT IN` conditions instead of `NOT IN`ing the `UNION` of three tables' personFK values? – Uueerdo Oct 04 '16 at 18:13

2 Answers2

3

This is what I would do with, say, a multi-million row half-fictitious schema like above.

For the person, I would add count columns, 1 each, related to the child tables, and a datetime. Such as

photoCount INT NOT NULL, 
...
lastUpdt DATETIME NOT NULL,

When it comes time for an INSERT/UPDATE on child tables (main focus naturally being insert), I would

  1. begin a transaction
  2. perform a "select for update" which renders an Intention Lock on the parent (people) row
  3. perform the child insert, such as a new picture or email
  4. increment the parent relevant count variable and set lastUpdt=now()
  5. commit the tran (which releases the intention lock)

A delete against a child row is like above but with a decrement.

Whether these are done client-side / Stored Procs/ Trigger is your choice.

Have an Event see 1 and 2 that fires off once a week (you choose how often) that deletes people rows that have lastUpdt greater than 1 week and the count columns all at zero.

I realize the Intention Lock is not an exact analogy but the point about timeouts and row-level locking and the need for speed are relevant.

As always carefully craft your indexes considering frequency of use, real benefit, and potential drags on the system.

As for any periodic cleanup Events, schedule them to run in low peak hours with the scheduler.

There are some natural downsides to all of this. But if those summary numbers are useful for other profile pages, and fetching them on the fly is too costly, you benefit by it. Also you certainly evade what I see in your two proposed solutions as expensive calls.

Community
  • 1
  • 1
Drew
  • 24,851
  • 10
  • 43
  • 78
1

I try duplicate your scenario here using postgreSQL. But I think there is something else you didnt tell us.

Both A & P are about a million rows apiece, and E a few thousand.

table people= 10k records
I select 9500 record at random and insert into email_subscriptions
Then duplicate those 9500 records 100 times for attendees and photo_tags total 950k on each table

SQL FIDDLE DEMO

First query need 5 sec
Second one need 11 millisec.

Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118