Performance tuning on reading huge table

Question

I have a huge table with more than one hundred million of rows and I have to query this table to return a set of data in a minimum of time.

So I have created a test environment with this table definition:

CREATE TABLE [dbo].[Test](
[Dim1ID] [nvarchar](20) NOT NULL,
[Dim2ID] [nvarchar](20) NOT NULL,
[Dim3ID] [nvarchar](4) NOT NULL,
[Dim4ID] [smalldatetime] NOT NULL,
[Dim5ID] [nvarchar](20) NOT NULL,
[Dim6ID] [nvarchar](4) NOT NULL,
[Dim7ID] [nvarchar](4) NOT NULL,
[Dim8ID] [nvarchar](4) NOT NULL,
[Dim9ID] [nvarchar](4) NOT NULL,
[Dim10ID] [nvarchar](4) NOT NULL,
[Dim11ID] [nvarchar](20) NOT NULL,
[Value] [decimal](21, 6) NOT NULL,
 CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED 
(
[Dim1ID] ASC,
[Dim2ID] ASC,
[Dim3ID] ASC,
[Dim4ID] ASC,
[Dim5ID] ASC,
[Dim6ID] ASC,
[Dim7ID] ASC,
[Dim8ID] ASC,
[Dim9ID] ASC,
[Dim10ID] ASC,
[Dim11ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

This table is the fact table of Star schema architecture (fact/dimensions). As you can see I have a clustered index on all the columns except for the “Value” column.

I have filled this data with approx. 10,000,000 rows for testing purpose. The fragmentation is currently at 0.01%.

I would like to improve the performance when reading a set of rows from this table using this query:

DECLARE @Dim1ID nvarchar(20) = 'C1'
DECLARE @Dim9ID nvarchar(4) = 'VRT1'
DECLARE @Dim10ID nvarchar(4) = 'S1'
DECLARE @Dim6ID nvarchar(4) = 'FRA'
DECLARE @Dim7ID nvarchar(4) =  '' -- empty = all
DECLARE @Dim8ID nvarchar(4) = '' -- empty = all

DECLARE @Dim2 TABLE ( Dim2ID nvarchar(20) NOT NULL )
INSERT INTO @Dim2 VALUES ('A1'), ('A2'), ('A3'), ('A4');

DECLARE @Dim3 TABLE ( Dim3ID nvarchar(4) NOT NULL )
INSERT INTO @Dim3 VALUES ('P1');

DECLARE @Dim4ID TABLE ( Dim4ID smalldatetime NOT NULL )
INSERT INTO @Dim4ID VALUES ('2009-01-01'), ('2009-01-02'), ('2009-01-03');

DECLARE @Dim11 TABLE ( Dim11ID nvarchar(20) NOT NULL )
INSERT INTO @Dim11 VALUES ('Var0001'), ('Var0040'), ('Var0060'), ('Var0099')

SELECT   RD.Dim2ID,
         RD.Dim3ID,
         RD.Dim4ID,
         RD.Dim5ID,
         RD.Dim6ID,
         RD.Dim7ID,
         RD.Dim8ID,
         RD.Dim9ID,
         RD.Dim10ID,
         RD.Dim11ID,
         RD.Value
FROM     dbo.Test RD
         INNER JOIN @Dim2 R
           ON RD.Dim2ID = R.Dim2ID
         INNER JOIN @Dim3 C
           ON RD.Dim3ID = C.Dim3ID
         INNER JOIN @Dim4ID P
           ON RD.Dim4ID = P.Dim4ID
         INNER JOIN @Dim11 V
          ON RD.Dim11ID = V.Dim11ID
WHERE    RD.Dim1ID = @Dim1ID
         AND RD.Dim9ID = @Dim9ID
         AND ((@Dim6ID <> '' AND RD.Dim6ID = @Dim6ID) OR @Dim6ID = '')
         AND ((@Dim7ID <> '' AND RD.Dim7ID = @Dim7ID) OR @Dim7ID = '')
         AND ((@Dim8ID <>'' AND RD.Dim8ID = @Dim8ID) OR @Dim8ID = '')

I have tested this query and that’s returned 180 rows with these times: 1st execution: 1 min 32 sec; 2nd execution: 1 min.

I would like to return the data in a few seconds if it’s possible.

I think I can add the non-clustered indexes but I am not sure what the best way is to set the non-clustered indexes! If having sorted order data in this table could improve the performances? Or are there other solutions than indexes?

Thanks.

For confidentiality I preferred not put the true names. But I reckon I could write dummy names. — Dan, Mar 28 '11 at 16:45

score 3 · Accepted Answer · edited May 23 '17 at 09:58

Consider your datatypes as one problem. Do you need nvarchar? It's measurably slower

Second problem: the PK is wrong for your query, It should be Dim1ID, Dim9ID first (or vice versa based on selectivity). or some flavour with the JOIN columns in.

Third problem: use of OR. This construct usually works despite what nay-sayers who don't try it will post.

RD.Dim7ID = ISNULL(@Dim7ID, RD.Dim7ID)

This assumes @Dim7ID is NULL though. The optimiser will short circuit it in most cases.

score 2 · Answer 2 · answered Mar 28 '11 at 16:47

I'm with gbn on this. Typically in star schema data warehouses, the dimension IDs are int, which is 4 bytes. Not only are all your dimensions larger than that, the nvarchar are both varying and using wide characters.

As far as indexing, just one clustering index may be fine since in the case of your fact table, you really don't have many facts. As gbn says, with your particular example, your index needs to be in the order of the columns which you are going to be providing so that the index can actually be used.

In a real-world case of a fact table with a number of facts, your clustered index is simply for data organization - you'll probably be expecting some non-clustered indexes for specific usages.

But I'm worried that your query specifies an ID parameter. Typically in a DW environment, you don't know the IDs, for selective queries, you select based on the dimensions, and the ids are meaningless surrogates:

SELECT *
FROM fact
INNER JOIN dim1
    ON fact.dim1id = dim1.id
WHERE dim1.attribute = ''

Have you looked at Kimball's books on dimensional modeling? I think if you are going to a star schema, you should probably be familiar with his design techniques, as well as the various pitfalls he discusses with the too many and too few dimensions.

RacerX · Answer 3 · 2011-03-28T16:34:36.493

see this: Dynamic Search Conditions in T-SQL Version for SQL 2008 (SP1 CU5 and later)

quick answer, if you are on the right service pack of SQL Server 2008, is to try adding that to the end of the query:

OPTION(RECOMPILE)

when on the proper service pack of SQL Server 2008, the OPTION(RECOMPILE) will build the execution plan based on the runtime value of the local variables.

For people still using SQl Server 2008 without the proper service packs or still on 2005 see: Dynamic Search Conditions in T-SQLVersion for SQL 2005 and Earlier

score 0 · Answer 4 · answered Mar 28 '11 at 16:39

I'd be a little concerned about having all the non-value columns in your clustered index. That will make for a large index in the non-leaf levels. And, that key will be used in the nonclustered indexes. And, it will only provide any benefit when [Dim1ID] is included in the query. So, even if you're only optimizing this query, you're probably getting a full scan.

I would consider a clustered index on the most-commonly used key, and if you have a lot of date-related queries (e.g., date between a and b), go with the date key. Then, create non clustered indexes on the other key values.

We have similar tables with even more columns in the key and billion rows+. This is normal for OLAP type tables. — gbn, Mar 28 '11 at 16:44

Performance tuning on reading huge table

4 Answers4