Improving query performance on partitioned views with large number of tables

Question

Background

I'm migrating an application away from a schema with a single event table holding 0.5-1.3 billion records to partitioned views with 30-180 identical underlying tables that each hold a fraction of that number of records which considerably eases a number of problems around the management and querying of that many records. I'm using partitioned views instead of table partitioning because not all my users have access to the Awesome Edition of SQL Server 2012.

The partition function is based on the cartesian product of the day of the event and one attribute of the event. In other words, all events for 2015 Jun 01 with attribute "foo" went into a table like "Event20150601_foo", all events with attribute "bar" went into "Event20150601_bar" and for the following day, events will go into "Event20150602_foo" and "Event20150602_bar", etc. There are typically 2-3 values of attribute and 15-60 days of events which maps to the typical range of 30-180 distinct tables.

The basic structure of each table is a composite clustered primary key consisting of EventId (bigint) and PartitionKey (int) followed by a handful of other unindexed columns. The EventId is unique, monotonically increasing across all tables, and implemented with a sequence. The PartitionKey is unique per partitioned table so the check constraint on each table is simply "CHECK (PartitionKey = x)" where x is defined per partition table. This allows me to search on EventId via a clustered index seek even if I cannot provide the partition key to narrow the scope of the search. In the case where I can provide both EventId and PartitionKey, the query is very efficient since the optimizer can then perform an index seek on only the single table that matches the partition key which is a big performance win.

Experiment

It would make my life even easier if I could go even more granular and partition based on the day of the event and a broader combination of attributes (500-5000 instead of the aforementioned 2-3) which would then require me to have in the range of 10,000-50,000 different partitions. I tested this out on a database with 20,000 different tables and found, unfortunately, that the running time for different operations goes up at a greater than linear rate with the number of tables. Per the query statistics, most of the time was spent in parsing and compiling the query rather than executing the query.

Here are the running times I found for different operations (pardon the rudimentary table):

Create the partitioned view
- 100 tables : 50ms
- 1000 tables : 800ms
- 2000 tables : 2,660ms
- 4000 tables : 10,000ms
- 16000 tables: 225,000ms
select * from PartitionedView where EventId = x (scans all tables)
- 100 tables : 78ms parse/compile time; 25ms exec time
- 1000 tables : 3,500ms parse/compile time; 160ms exec time
- 2000 tables : 15,000ms parse/compile time; 500ms exec time
- 4000 tables : 68,000ms parse/compile time; 2,000ms exec time
- 16000 tables: cancelled after > 10 minutes parse/compile time!
select * from PartitionedView where (EventId = x) and (PartitionKey = y) (scans only one table)
- 100 tables : 74ms parse/compile time; 1ms exec time
- 1000 tables : 2,500ms parse/compile time; 15ms exec time
- 2000 tables : 11,000ms parse/compile time; 10ms exec time
- 4000 tables : 50,000ms parse/compile time; 16ms exec time
- 16000 tables: cancelled after > 10 minutes parse/compile time!

Question

Should I be scolded for even considering using that many tables? If not, is there a way I can reduce the query parse & compile times when there are large numbers of tables present?

The slow performance is obviously down to the compile phase, as your statistics show. Presumably SQL Server is building an execution plan for each table in the view and hashing them against each other? Being able to see the execution plan might offer some more insight. Perhaps changing the check constraint to include the EventId range of each table would allow the query optimizer to discount the irrelevant tables from the start? — Ed B, Jun 03 '15 at 16:19
@EdB The execution plans for the queries show index seeks against all the underlying tables for the `select` statements. Looking at either "Number of executions" or the IO statistics shows that when a value for the column with the partition key is specified, the irrelevant tables are skipped. — Dan Hermann, Jun 05 '15 at 11:52

Improving query performance on partitioned views with large number of tables

Background

Experiment

Question

0 Answers0