0

Problem: I have 2 rather large tables: message (+20 million records ~ 11GB) and message_subject (14 million records ~ 3GB) that need to be 'cleaned up'.

I need to delete all messages older than 2 weeks along with the corresponding message_subjects if the message has a subject, but it's important to note that not all messages have a subject and messages older than 2 weeks without a subject need to be removed too.

These 2 tables do NOT have a relation specified through InnoDB, even though message_subject has the field message_id.

If a relation was specified between the 2 tables with ON DELETE CASCADE, I could just delete the record from message and the corresponding message_subject would automatically be removed, however this isn't the case.

I have tried:

DELETE message, message_subject
FROM message
INNER JOIN faq 
WHERE message.id = message_subject.message_id 
AND message.add_date < DATE_SUB(CURDATE(),INTERVAL 2 WEEK)

Which deletes messages that have a subject; messages older than 2 weeks without a subject are not deleted.

DELETE message, message_subject
FROM message
INNER JOIN message_subject
      ON message.id = message_subject.message_id 
WHERE message.add_date < DATE_SUB(CURDATE(),INTERVAL 2 WEEK)

does exactly the same as the first query; it doesn't delete the old messages without a subject.

Can anybody help me out here?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
deacs
  • 4,259
  • 2
  • 27
  • 37
  • 1
    Are you looking for an `outer join` instead? – sgeddes Dec 31 '14 at 14:18
  • @sgeddes lemme give that a shoot and see what happens. – deacs Dec 31 '14 at 14:22
  • generally being in a position where you have so much data to delete is a pain because you could run into performance issues. For example it could be really slow because you have many delete operations or your redo log may get too big. Consider deleting in batches, i.e. committing every 10K records (you should performance test the threshold to see what works best for you), and I would even suggest that you either start deleting data more frequently or redesign your database to use partitions because they are quick to dispose of. – benji Dec 31 '14 at 14:24
  • @benji right you are, it really is a pain to delete so many records on a production server which can't be slowed down too much, so I will indeed be runnning a cron that does the deleting in batches, that's the reason I want to do it with 1 query. – deacs Dec 31 '14 at 15:01
  • 1
    @deacs what's the problem with doing it in 2? i.e. first delete all the subjects and then all the messages.. – benji Dec 31 '14 at 15:04
  • @sgeddes looks like the outer join worked, do you want to post an answer so I can award you your correct answer? – deacs Dec 31 '14 at 15:04
  • @benji if i use 2 queries, i will have to join in the second query anyway (to get subjects without a message), so may as well use a single query with a join to do both at the same time. so not really a problem, just more optimal imo. – deacs Dec 31 '14 at 15:08

1 Answers1

1

Thanks to sgeddes' comment, I was able to solve the problem using the following query:

DELETE message, message_subject
FROM message
LEFT OUTER JOIN message_subject
      ON message.id = message_subject.message_id 
WHERE message.add_date < DATE_SUB(CURDATE(),INTERVAL 2 WEEK)

Which apparently is the same as

DELETE message, message_subject
FROM message
LEFT JOIN message_subject
      ON message.id = message_subject.message_id 
WHERE message.add_date < DATE_SUB(CURDATE(),INTERVAL 2 WEEK)
deacs
  • 4,259
  • 2
  • 27
  • 37
  • Feel free to accept this as the answer - although I wonder if creating new tables from the good data wouldnt be faster – Strawberry Dec 31 '14 at 15:50
  • @Strawberry I'll only be able to accept my own answer after 2 days. And I agree, a whole new table would be a faster sollution, that unfortunately isn't an option in my case. – deacs Dec 31 '14 at 16:19