1

I have a set of tables with a large row sets (think billions of rows) and I need to get the number of rows count between two parameter values and set two values from the result; I have COUNT(*) at present.

This will get executed several thousand times against the same table with different values passed in to compare to as part of a larger query where the input parameters change and the row counts in the overall table change by several hundred thousand or even a million rows as the table increases in size.

There may be several (4-5) of the same queries running against this same table at the same time but with different parameter values. The compared value column is indexed and is a single primary key in some tables but in others it may be part of a compound primary key. The returned count will in most cases vary from 500,000 to 1,000,000.

Is there any faster way to get and use count values from a changing range?

!Important, this is NOT for the entire table but for a range within the table thus things like sp_spaceused are not possible.

SELECT 
    @newOUT = COUNT(*) , 
    @RightValueOUT = COUNT(*)
FROM mydatabase.myschema.myidcolumn
WHERE 
    myidcolumn >= @MinRowValue
    AND myidcolumn <= @MaxRowValue
Mark Schultheiss
  • 32,614
  • 12
  • 69
  • 100
  • 1
    Possible duplicate of [Fastest way to count exact number of rows in a very large table?](https://stackoverflow.com/questions/6069237/fastest-way-to-count-exact-number-of-rows-in-a-very-large-table) – cetver May 18 '19 at 14:28
  • Does the count need to be 100% exact or is an estimation good enough? Is the `myidcolumn` an identity column? Do a lot of rollbacks happen for inserts (often, rare, extremely rare, ...)? Would also be interesting to know what the use-case is for these counts, as the software seem to be doing a lot of them? – TT. May 18 '19 at 14:31
  • You could consider making a "calculated" [indexed view](https://learn.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-2017) with (`WITH SCHEMABINDING` option) which automatically updates the totals in the view when the data changes to service queries like these.. But iám pretty sure you will need extra diskspace to handle this. – Raymond Nijland May 18 '19 at 14:39
  • is combining the separate queries an option so it can calculate multiple different counts at the same time? – Martin Smith May 18 '19 at 14:51
  • 1
    @cetver It's not a duplicate of that question, as the question is about counting number of rows in a range as opposed to the whole table. – TT. May 18 '19 at 15:27
  • Count has to be exact. Some of the use has an identity column but not all. No rollbacks. Cannot use views or other constructs, the number of tables with the requirement is too high. Cannot really combine - the rows in each range will be changing over time, sometimes fast. Range IS important – Mark Schultheiss May 18 '19 at 15:59
  • I do NOT consider this to be a duplicate due to the range, and I do know about sp_spaceused but that gets the entire table which is not what is desired. – Mark Schultheiss May 18 '19 at 16:09
  • Note also to clarify there may be other writes before and after the range. – Mark Schultheiss May 18 '19 at 16:17
  • @TT. Use case here is to copy a huge number of rows from one set of tables to another. One reason this is presented is to determine count in a given range is an exact match to a count with another table for the same range and if so skip extra work to make it so by a subsequent copy query - basically we can then skip prior copy of a set within a range. Copy is being done in range batches for various reasons, some are errors caused by external events and others are to choose a range of rows to "precopy" with early create/update times before the mass of recent rows is processed – Mark Schultheiss May 18 '19 at 16:50

2 Answers2

2

I can think of two options for such a situation.

First, when the id being compared is part of a composite primary key, then be sure that the id being used is the first key in the primary key. If you have to do the comparison on multiple keys, then use a secondary index for each subkey.

Second, if these are insert-only tables with increasing primary keys, then you can insert a cumulative count at each step. This requires a trigger or other logic on the insert.

Then you could retrieve the count by looking up the extreme values. That is two lookups rather than an index scan.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, I cannot really change the indexes, the inserts may occur out of order but the given range remains. There are possible gaps within the range. Note the values are copied and in the source table at least one value will exist but might not in the target table (yet) – Mark Schultheiss May 18 '19 at 16:02
  • @MarkSchultheiss . . . If you have cumulative counts, you can still optimize the query even if the end points are not in the query. But there is overhead to keeping the cumulative counts up-to-date. – Gordon Linoff May 18 '19 at 16:09
  • Difficult to use a cumulative method here, there may be multiple queries running at one time and my tempdb is already under pressure for space and memory - there is a LOT going on with heavy writes to multiple tables and from peer queries to the same table as well. – Mark Schultheiss May 18 '19 at 16:12
  • Good note on the composite primary key - I may consider that in some cases - one catch is that the range values are derived by a dense ID (most populous) of the columns in an attempt to have a minimal number of attempts to insert into a given range where the other columns do not match a given row thus reduce blocking - and also to create a predictable relative number of rows in each range. – Mark Schultheiss May 18 '19 at 16:25
1

Count queries can be tough to optimize, because the COUNT(*) operation involves touching every record in scope for the count. But, we can try indexing the myidcolumn, which appears in the WHERE clause, with the hope that it is restrictive:

CREATE INDEX idx ON yourTable (myidcolumn);

If SQL Server uses the index, it might be able to just do a single index scan to generate the count.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360