5

I have a large table with 100,000,000 rows. I'd like to select every n'th row from the table. My first instinct is to use something like this:

SELECT id,name FROM table WHERE id%125000=0

to retrieve a even spread of 800 rows (id is a clustered index)

This technique works fine on smaller data sets but with my larger table the query takes 2.5 minutes. I assume this is because the modulus operation is applied to every row. Is there a more optimal method of row skipping ?

Brian
  • 83
  • 1
  • 7

4 Answers4

2

The time is not going into the modulus operation itself, but rather into just reading 124,999 unnecessary rows for every row that you actually want (i.e., the Table Scan or Clustered Index Scan).

Just about the only way to speed up a query like this is something that seems at first illogical: Add an extra non-Clustered index on just that column ([ID]). Additionally, you may have to add an Index Hint to force it to use that index. And finally, it may not actually make it faster, though for a modulus of 125,000+, it should be (though it'll never be truly fast).


If your IDs are not necessarily contiguous (any deleted rows will pretty much cause this) and you really do need exactly every modulo rows, by ID order, then you can still use the approach above, but you will have to resequence the IDs for the Modulo operation using ROW_NUMBER() OVER(ORDER BY ID) in the query.

RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
  • It would be great if filtered indexes supported the modulo operator, or a computed column that calculated the modulo for you could be used. – Aaron Bertrand Sep 03 '13 at 19:16
  • @AaronBertrand Yeah, of course you could also set an additional column manually to facilitate the same effect. – RBarryYoung Sep 03 '13 at 19:18
2

Your query assumes that the IDs are contiguous (and probably they aren't without you realizing this...). Anyway, you should generate the IDs yourself:

select *
from T
where ID in (0, 250000*1, 250000*2, ...)

Maybe you need a TVP to send all IDs because there are so many. Or, you produce the IDs on the server in T-SQL or a SQLCLR function or a numbers table.

This technique allows you to perform index seeks and will be the fastest you can possibly produce. It reads the minimal amount of data possible.

Modulo is not SARGable. SQL Server could support this if Microsoft wanted it, but this is an exotic use case. They will never make modulo SARGable and they shouldn't.

usr
  • 168,620
  • 35
  • 240
  • 369
  • I don't know about the performance implications, but you could inner join with this CTE instead of using IN (...) `with NumberRange(Number) as ( select 125000 as Number union all select Number + 125000 from NumberRange where Number < (125000 * 800) ) select * from NumberRange option (maxrecursion 800)` – Jeremy Cook Sep 03 '13 at 19:49
  • I was wondering if someone was going to mention the non-Contiguous ID problem ... :-) – RBarryYoung Sep 03 '13 at 19:55
  • The problem is that it could possibly miss many of those rows because not all ID values are populated. – Aaron Bertrand Sep 03 '13 at 19:59
  • @AaronBertrand valid concern. Remedy: drive the query off the IDs and cross apply the top 1 row that is greater than the given ID. Still not perfect, but much better. Still perfect index seeks. – usr Sep 03 '13 at 20:07
1

If id is in an index, then I am thinking of something along these lines:

with ids as (
      select 1 as id
      union all
      select id + 125000
      from ids
      where id <= 100000000
  )
select ids.id,
       (select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);

I think this formulation will use the index on table.

EDIT:

As I think about this approach, you can actually use it to get actual random ids in the table, rather than just evenly spaced ones:

with ids as (
      select 1 as cnt,
             ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000 as id
      union all
      select cnt + 1, ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000
      from ids
      where cnt < 800
  )

select ids.id,
       (select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);

The code for the actual random number generator came from here.

EDIT:

Due to quirks in SQL Server, you can still get non-contiguous ids, even in your scenario. This accepted answer explains the cause. In short, identity values are not allocated one at a time, but rather in groups. The server can fail and even unused values get skipped.

One reason I wanted to do the random sampling was to help avoid this problem. Presumably, the above situation is rather rare on most systems. You can use the random sampling to generate say 900 ids. From these, you should be able to find 800 that are actually available for your sample.

Community
  • 1
  • 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I'm marking this answer as correct, but to be fair to the other most excellent contributors I should have added more detail of my use case to the question. The table is generated from a data logger. The id value is guaranteed to be contiguous as the table will remain static. The reason why Gordon's answer suits my application is that I'm trying to create an overview graph of the data, so using a random sample will be good enough. Incidentally I tried TABLESAMPLE which should in theory produce the same result, but on SQL SERVER 2012 at least I don't see a truly random selection. – Brian Sep 04 '13 at 09:11
0
 DECLARE @i int, @max int, @query VARCHAR(1000)
 SET @i = 0
 SET @max = (SELECT max(id)/125000 FROM Table1)
 SET @query = 'SELECT id, name FROM Table1 WHERE id in ('
 WHILE @i <= @max
 BEGIN
     IF @i > 0 SET @query = @query + ','
     SET @query = @query + CAST(@i*125000 as varchar(12))
     SET @i = @i + 1
 END
 SET @query = @query + ')'
 EXEC(@query)

EDIT :

To avoid any "holes" in a non-Contiguous ID situation, you can try something like this :

DECLARE @i int, @start int, @id int, @max int, @query VARCHAR(1000)
SET @i = 0
SET @max = (SELECT max(id)/125000 FROM Table1)
SET @query = 'SELECT id, name FROM Table1 WHERE id in ('
WHILE @i <= @max
BEGIN
    SET @start = @i*125000
    SET @id = (SELECT TOP 1 id FROM Table1 WHERE id >= @start ORDER BY id ASC)
    IF @i > 0 SET @query = @query + ','
    SET @query = @query + CAST(@id as VARCHAR(12))
    SET @i = @i + 1
END
SET @query = @query + ')'
EXEC(@query)
Fabien TheSolution
  • 5,055
  • 1
  • 18
  • 30