Strategy on maintaining giant database

Question

We have a giant SQL Server 2005 database (75GB) which basically is just data in one table with sales values (per day, store and article). We want to make place by adding up the sales value per week for every record older than a year (still grouped per store and article). So in theory for data older than a year we can then remove 6 out 7 records.

Writing a procedure to do this is not really a problem but it runs like forever. So I was looking for a strategy that could make this run in a reasonable amount of time.

To give you an idea: running SELECT count(*) ran for over 4 minutes

We do have a few indexes (on date (clustered) and on the store, article and date combination). Adding any more indexes also takes like forever.

Anyone has got a good strategy on how to perform this task? Any suggestions on TSQL methods which perform better than the basic DML statements?

Not really an option. Data is growing, numerous functions depend on it for statistical info. The schema should not be touched... — Koen, Nov 07 '11 at 09:52
Do you want to boost performance or reduce disk space used? Do you need to maintain granularity of data? COUNT will always take time eg http://stackoverflow.com/questions/6069237/fastest-way-to-count-exact-number-of-rows-in-a-very-large-table/6069288#6069288 this table was 200GB+ at the time — gbn, Nov 07 '11 at 10:28
The purpose of the procedure is both. Obviously I'm asking about how to solve peformance issues in the procedure that will reduce the data (which will also run weekly to do the same for new data). We don't keep detailed records. The sales value is already a sum of incoming data which we now want to group per week instead of per day for older data. — Koen, Nov 07 '11 at 11:38

score 1 · Accepted Answer · answered Nov 07 '11 at 11:13

If You use SQL Server 2005 Enterprise Edition, You should consider using partitioning feature. Benefits:

by partitioning your data on date column, queries will run much faster because SQL Server will access only specific partition; this way you can run your day->week procedure on date range and it will run much faster (and run multiple procedures on different date ranges at the same time)
if You want to keep your daily data, just move old partitions to slower storage (hard disk)
your procedure should prepare weekly data in new table and then switch partitions - it is much faster than delete daily data and insert weekly data

If You don't use Enterprise Edition, use this link to see partitioning (sharding or horizontal partitioning) capabilities not based on SQL Server 2005 partitioning feature.

For stored procedure optimization:

reevaluate current indexes for your SP
consider daily->week procedure to run on date ranges, for instance, year by year or month by month - running procedure on whole history will be a lot of work for SQL Server and underlying hardware
probably the best way is: following previous item about date ranges, create new table based on old weekly data and recent daily data then create indexes and then in one transaction drop original table and use sp_rename to put old table instead of new one - rename is almost instant so no one will notice the delay if that is important
consider dropping indexes on target table because inserts will be much slower with it - only if You are working on the original table (delete + insert)

Off-topic hint: if using Enterprise Edition, consider compressing your table since SQL Server 2005 is usually good at compression of fact tables - you will probably gain both performance and disk space if You have enough CPU power.

It's standard edition. Sharding is not an option. We can't change the schema and there are overlapping scenario's, e.g. one week can fall in two financial booking years so in that case we need to split up the week statistics. Your suggested solution is what we already had up till now which is a pain in the ass. It does 50 minutes over 1 month of data (way too much). Any idea which statements are designed for this? We use cursors now which are typically slow but it's the only way I got it right up till now... — Koen, Nov 07 '11 at 11:42
Do You use cursors to iterate over calendar periods or iterate over each record in this table? You should avoid cursors and implement bold item in answer using set-logic (instead of cursor-row-by-row logic). Post the code You use (whole procedure with comments) so we can review it. P.S.: no specific DML for this, probably first thing to do is use set-logic instead of cursors — Filip Popović, Nov 07 '11 at 13:31
That is where the overlapping scenario's come in. There's a few exceptions where you can't just grab and process a whole week because it's just not correct statistically. What do you mean by set-logic? — Koen, Nov 07 '11 at 13:41
You can process data row by row (one select statement for each row or iterating thru recordset with cursor) or process data in sets (multiple rows at once). I am sure that You know what set based logic is, just terminology problem. Check: http://www.codeproject.com/KB/database/SetAndProceduralSQL.aspx You didn't post your code so we can't help You much because there is no general-purpose solution to performance problems. — Filip Popović, Nov 07 '11 at 13:56

score 0 · Answer 2 · answered Nov 08 '11 at 05:46

Can you please share the schema?

Did you try using WITH (NOLOCK) or setting ISOLATION LEVEL to READ UNCOMMITTED?

Sometimes we get fixated on the fact that we can't make any schema change and we have to find solution without making any significant change. You can always make changes in the underlying table and then expose views to consuming clients. If you have stored procs, then the table schema is free to be changed since stored procs will encapsulate access to the table. If you say you can't change the stored proc either and you can't create any views - I will question why you are under such a rigid policy and how long do you think you can really survive with such policy. What if the database grows to 200GB in a year. Will you then take drastic approach which will cost significantly more time and money to fix it? Or shall we do it now when it is still small?

My suggestions would be:

Partition the table.
Let clients read data from a view that does not change.
Let all database operations go through stored proc.
Do all optimizations inside stored proc.

For a short term "fix" to alleviate some pain for now you can try:

If you have SATA drives, convert them to SAS. That will give drastic IO boost.
Use RAID 5 which is better for read.
Make sure MDF and LDF are in completely different physical drives. If you can afford, put them in separate RAID 5 controllers. Otherwise put LDF in RAID 1 and MDF in RAID 5.
Add another drive and add another MDF file to it. This will then spread the new insert, update, delete over multiple disks. Thus reading will be performed from multiple disks and might give you better thruput.
Rebuild the clustered index.
Use a Windows Server Disk Defragmentation software to defrag the disk.
Upgrade to a better processor that has more L2 cache.

score 0 · Answer 3 · answered Nov 08 '11 at 06:49

0

Can you tell us a bit more about the hardware of your server? Basically when data gets big put in lots of fast disks.

Also on standard edition you can still create subtables and a view over them to sort of get partitioning. Usually the older data does not get queried as often as the newer data, you can take adavntage of that by putting the data that gets queried most on faster disks than the older.

Not sure what the data access patterns are, but have you looked into Analysis Services? You already paid for it and it can show dramatic speedups for analytic queries because it uses a lot of aggregates. Also with excel as front end a savvy user can create a lot of reports themselves leavind your time to do the intersting ones.

Just a few thoughts from me,

Rgds Gert-Jan

answered Nov 08 '11 at 06:49

gjvdkamp

9,929
3
38
46

We were just hired to do this job. We can't come up with any alternative solutions since that's not in the scope of the job. It's not my own company, no say in hardware, no changing the schema because there's just too much that depends on it. Otherwise this question would be on the DBA site... – Koen Nov 08 '11 at 09:15
ok I see. What can slow things down with large DML operations are indexes. You can disable them for the opertaion and rebuild afterwards. (but not the one you need to find the records you want to update). That could help a lot. Otherwise I'd suggest doing small batches at a time like a couple of thousand, not all in once because it needs to make sure it can rollback the operation. – gjvdkamp Nov 08 '11 at 14:11

Strategy on maintaining giant database

3 Answers3