How to deal with billions of records in an sql server?

Question

I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.

But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.

Is there any way to opimize for the query performance? For example, the query

select * from mymajortable where date between '2000/10/10' and '2010/10/10'

takes around 15 min to retrieve around 10000 records.

may be duplicate of http://stackoverflow.com/questions/2794736/best-data-store-for-billions-of-rows — Thunder, Feb 09 '11 at 09:07
You can always split the database into multiple partitions. You don't have to buy anything. You just have to query the right table yourself, instead of the server figuring out which table. — Dan Grossman, Feb 09 '11 at 09:09
do you know what clustered indexes are ?? http://msdn.microsoft.com/en-us/library/aa933131%28v=sql.80%29.aspx — Jon Black, Feb 09 '11 at 09:11
@Dan when ever i fire and partition function query it says "Partition function can only be created in Enterprise edition of SQL Server. Only Enterprise edition of SQL Server supports partitioning." — JSJ, Feb 09 '11 at 09:25
@Martin I have already create indexes on fields any thing else — JSJ, Feb 09 '11 at 09:26
@J S Jodha You don't use the partition functionality. You create the tables yourself, and put part of the rows into each one yourself, determine which tables to query yourself. Partitioning has existed since long before "Enterprise Edition" anything existed. — Dan Grossman, Feb 09 '11 at 09:31
@J S Jodha - Well in that case you need to investigate what on earth is causing the delay then. 15 minutes to return 10,000 records even through bookmark lookups sounds way over what should be expected which is why I was confident there was no index on `date`. What does `*` include? Any BLOB columns? — Martin Smith, Feb 09 '11 at 09:42
Also if a 10 year date range only encompasses 10,000 records what range do the other 29,999,990,000 cover? Is this scenario entirely fictitious/hypothetical? I see you've already commented to Mitch's post that you aren't using `*` What else have you left out? — Martin Smith, Feb 09 '11 at 10:23
Do not just create indexes on fields but on field combinations and use included columns to for data you retrieve often. — David Mårtensson, Feb 09 '11 at 10:57

Mitch Wheat · Accepted Answer · 2012-09-13T23:25:38.053

A SELECT * will obviously be less efficiently served than a query that uses a covering index.

First step: examine the query plan and look for and table scans and the steps taking the most effort(%)

If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).

You could break your data up into separate tables (say by a date range) and combine via a view.

It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)

Suggest you post your table and index definitions.

Thanks. i am avoiding the * functionality this is only for example. do you any other idea that can increase performance. — JSJ, Feb 09 '11 at 10:10

David Mårtensson · Answer 2 · 2011-02-09T10:53:59.457

First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.

Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.

Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.

One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.

It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.

It is a good source to understand how the server works.

score 0 · Answer 3 · answered Sep 12 '12 at 22:07

What type of queries are we talking about?

Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.

If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.

I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.

I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.

score 0 · Answer 4 · answered Feb 09 '11 at 10:08

If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.

To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *

another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.

You should also fully quallify the object name to make sure your plan is reusable

score 0 · Answer 5 · answered Feb 09 '11 at 20:27

0

you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.

answered Feb 09 '11 at 20:27

DForck42

19,789
13
59
84

How to deal with billions of records in an sql server?

5 Answers5