1

This is basically the same question asked in (Solution for speeding up a slow SELECT DISTINCT query in Postgres )

It is pretty the same situation, a huge deployed database and it is not possible to normalise it due to old legacy applications. New rows are constantly added and old ones are removed fulfiling certain criterias. I have tried the suggestion by using CTE and I see no performance gain here, pretty the same execution time as for the original

select distinct [somecolumn] 
from bigtable

Same applies for the suggestion to use Group by.

What seems to work best is the suggestion to create a view and query the view instead. (the cache has been reseted between queries)

I need some advice here on this because I do not really understand why this results in better performance.

create view [dbo].[vwDistinct]
with schemabinding
as 
   select 
      [somecolumn], count_big(*) as TableCount
   from 
      dbo.BigTable 
   where
      somecolumn IS NOT NULL
   group by 
      somecolumn;

select distinct somecolumn 
from vwDistinct

The application uses a stored procedure for the call. The database is on SQL Server 2008 R2 but if there is good reason, it can be moved to SQL Server 2014.

Thank you

This is the execution plan for select somecolumn from vwDistinct

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.2" Build="12.0.2000.8" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="41138.3" StatementId="1" StatementOptmLevel="FULL" CardinalityEstimationModelVersion="70" StatementSubTreeCost="5.10782" StatementText="select somecolumn from vwDistinct" StatementType="SELECT" QueryHash="0x23700E4CF62A8E4E" QueryPlanHash="0x79D8240601D270CB" RetrievedFromCache="true">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan NonParallelPlanReason="EstimatedDOPIsOne" CachedPlanSize="16" CompileTime="59" CompileCPU="18" CompileMemory="336">
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="1239807" EstimatedPagesCached="77487" EstimatedAvailableDegreeOfParallelism="1" />
            <RelOp AvgRowSize="37" EstimateCPU="1.03877" EstimateIO="4.06905" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="944197" LogicalOp="Clustered Index Scan" NodeId="1" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="5.10782" TableCardinality="944197">
              <OutputList>
                <ColumnReference Database="[BigData]" Schema="[dbo]" Table="[vwDistinct]" Column="somecolumn" />
              </OutputList>
              <IndexScan Ordered="false" ForcedIndex="false" ForceScan="false" NoExpandHint="true" Storage="RowStore">
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Database="[BigData]" Schema="[dbo]" Table="[vwDistinct]" Column="somecolumn" />
                  </DefinedValue>
                </DefinedValues>
                <Object Database="[BigData]" Schema="[dbo]" Table="[vwDistinct]" Index="[cdxDistinct]" IndexKind="ViewClustered" Storage="RowStore" />
                <IndexedViewInfo>
                  <Object Database="[BigData]" Schema="[dbo]" Table="[BigTable]" />
                </IndexedViewInfo>
              </IndexScan>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>
Community
  • 1
  • 1
Hans
  • 269
  • 4
  • 14
  • 1
    Since the view does group by somecolumn, it will only return distinct somecolumn's. So you can do just "select somecolumn from vwDistinct" and still get distinct values! – jarlh Feb 10 '15 at 10:04
  • You can add index on that column – Giorgi Nakeuri Feb 10 '15 at 10:05
  • 1
    How many rows is there in the table and how many distinct values is there in `somecolumn`? – Mikael Eriksson Feb 10 '15 at 10:07
  • Is it faster without the `GROUP BY` and `COUNT_BIG`? So `select DISTINCT [somecolumn] from dbo.BigTable`. Just a wild guess, maybe you need to refactor the schema and create a new table which contains those distinct `somecolumn`-values and link both tables via foreign-keys. – Tim Schmelter Feb 10 '15 at 10:10
  • are there very many `NULL` values in `[somecolumn]`? I notice the second query filters out `NULL` values. How does `SELECT DISTINCT [somecolumn] FROM [dbo].[bigtable] WHERE [somecolumn] IS NOT NULL;` perform? – Jodrell Feb 10 '15 at 10:44
  • What indices including `[somecolumn]` do you have on `[dbo].[bigtable]`? – Jodrell Feb 10 '15 at 10:50
  • Around 2.5-3 milion rows and there are about 100 distinct values. – Hans Feb 10 '15 at 10:52
  • I cannot refactor the schema since old legacy applications access the tables directly with client sql. – Hans Feb 10 '15 at 10:53
  • the bigtable contains about 25 columns and there are already an index on the column CREATE NONCLUSTERED INDEX [IX_BigTable_somecolumn] ON [dbo].[BigTable] ( [SomeColumn] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) – Hans Feb 10 '15 at 10:55
  • Your query has estimated 944197 distinct values and you say you have 100. There must be something wrong with the statistics for that to happen. – Mikael Eriksson Feb 10 '15 at 11:22
  • You are right, I picked the execution plan from another test db but with the same scheme. – Hans Feb 10 '15 at 11:59

1 Answers1

3
<Object Database="[BigData]" Schema="[dbo]" Table="[vwDistinct]" Index="[cdxDistinct]" IndexKind="ViewClustered" Storage="RowStore" /> 

shows that your view is indexed. Based on your comment where you state

Around 2.5-3 milion rows and there are about 100 distinct values.

The query

select distinct [somecolumn] 
from bigtable

without the view will scan 2.5+ million rows in the table's index to find all the distinct values.

The view, however, will only contain 100 rows. So when that exists it can perform a scan on the view's clustered index to find all the distinct values.

The cost is that all inserts and any updates that modify somecolumn will be more expensive.

Andy Nichols
  • 2,952
  • 2
  • 20
  • 36