16

I need to clean out a very bloated SQL database by deleting records that are older than two years from a number of tables. What is the most efficient way of doing this?.

Nuwan
  • 1,226
  • 1
  • 14
  • 38
Garrett Dumas
  • 161
  • 1
  • 1
  • 3
  • 2
    What RDBMS? How many records older than 2 years? How many records younger than 2 years? – Martin Smith Jun 03 '10 at 20:59
  • Is there a common set of "created" or "last updated" fields on each table? – Mike Mooney Jun 03 '10 at 21:01
  • 1
    Adding SQL product you're using is important (MSSQL, mySQL, etc). In MSSQL for example, if you have a hundreds of thousands of rows, you're going to want to watch out for the transaction log growth. – Jeff Schumacher Jun 03 '10 at 21:05
  • I am using MSSQL. The biggest table has 114,000 rows while the other two have about half that amount. The data goes back to 2005, so I would assume that there are roughly 50,000-60,000 rows that will stay. Also, there is a datetime field in the table. – Garrett Dumas Jun 03 '10 at 21:16
  • @Garrett - That amount of rows shouldn't cause many problems. Do you have to worry at all about concurrent users? – Martin Smith Jun 03 '10 at 21:19
  • We are migrating our application, so this is a test environment. Concurrent users shouldn't be a problem. – Garrett Dumas Jun 03 '10 at 21:21
  • Platform? Schema? Constraints, Foreign Key Constraints? Any pattern to the names and/or data types of the columns which will identify the rows to be deleted? – Cade Roux Jun 03 '10 at 21:00

3 Answers3

19

Do you have any way to determine how "old" a record is? (i.e., is there a column in the table that represents either the age of the row or a date that can be used to calculate the age?). If so, it should be a simple

DELETE FROM Table WHERE Age > 2

For example, if you have a DateTime column called CreateDate, you could do this:

DELETE FROM Table WHERE DATEADD(year, 2, CreateDate) < getdate()
Adam Robinson
  • 182,639
  • 35
  • 285
  • 343
  • 2
    Could be quite nasty in terms of locks though. – Martin Smith Jun 03 '10 at 21:03
  • @Martin: I'm not sure that's avoidable, or even if you'd *want* it to be. I don't see how this is more "nasty" in terms of locks than any other `DELETE` or `UPDATE`, other than the fact that it involves a table scan. – Adam Robinson Jun 03 '10 at 21:04
  • 2
    Breaking deletes into batches could be better for concurrency. – Martin Smith Jun 03 '10 at 21:05
  • I do have a last updated datetime column. I am going to try that second one out and see if it is what I need. – Garrett Dumas Jun 03 '10 at 21:21
  • 1
    Martin is correct, I run into this all the time. In fact, in very large tables I've blown up the DB by filling up the transaction log. An alternate approach here might be a simple loop that deletes things a month at a time (from, say, 10 years old toward 2 years old, is 8 years or 96 chunks), executing each chunk as its own transaction. – Chris Wuestefeld Jun 03 '10 at 21:24
  • 1
    @Chris I've filled a disc up that way as well! Kind of counter intuitive the amount of space a delete can use. But @Garrett deleting 50,000-60,000 rows you shouldn't have any problems assuming you're not critically low on disc space already and haven't restricted the transaction log to a small size. – Martin Smith Jun 03 '10 at 21:32
  • If we do not maintain columns that can be used, is there any other way, perhaps some sort of meta data maintained by SQL? – Shamim Hafiz - MSFT Mar 29 '19 at 07:10
3

In addition to Adam Robinson's good answer: When performing this type of operation:

  1. Run a SELECT query with the DELETE's WHERE clause first to make sure you're getting "the right data"
  2. Do a full backup
  3. Run the thing in "off" hours so as not to affect users too much
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
2

I've seen dba do this in a few different companies and it always seems to use the following format:

  1. Backup the table
  2. Drop any indexes
  3. Select the rows you want to keep into a temp table
  4. Truncate the original table
  5. Insert (into your source table) from you temp table
  6. Recreate the indexes

The benefit to this approach is that this update doesnt write to the logs so they don't get blown by thousands of delete entries. It's also faster.

The drawback is that the update doesn't write to the logs so your only option is to restore your backup.

You should think about putting house keeping in place. If the above, is too scary, then you could also use the house keeping to winnow the database over a matter of time.

In MSSQL, you could create a job to run daily which deletes the first 1000 rows of your query. To steal Adam's query -

DELETE TOP 1000 FROM Table WHERE DATEADD(year, 2, CreateDate) < getdate()

This would be very safe and would get rid of your data in three months or so safely and would them also maintain the size of the db in the future.

Your database will get to use this space in the future but if you want to recover the space you will need to shrink the database. Read around if you are interested - whether it is worth it depends on the amount of space to recover versus the total size of the db.

Chanoch
  • 563
  • 7
  • 16
  • I think this should be the accepted answer. I myself am having to do this for two reasons... remove the last year of data from a table with 60+ million rows and remove duplicates. The first will be covered by the above idea. I'm curious whether the same SELECT I use to trim the date could be efficiently used to remove the duplicates. Not certain - yet. – D-Klotz Sep 20 '21 at 17:03