7

Using SQL Server, I have a table with around 5.5 million rows and I want to randomly select a set of maybe 120 rows that meet some criteria.

That's some what related to Select n random rows from SQL Server table and https://msdn.microsoft.com/en-us/library/cc441928.aspx, but my problem is that I want to be able to seed this so I can randomly pick the same 120 rows consistently and then get a different, random set of rows if I use a different seed.

I could do something like this in my application:

var rand = new Random(seed);
var allExamples = db.myTable.Where(/*some condition*/).ToList();
var subSet = db.myTable.Select(x => new { x, r = rand.NextDouble())
    .OrderBy(x => x.r)
    .Take(120)
    .Select(x => x.x).ToList();

Which works, but, as you might guess, with 5.5 million rows is glacially slow. So I'm really looking for a way to make this work on the SQL server side so I don't have to retrieve and process all the rows.

Community
  • 1
  • 1
Matt Burland
  • 44,552
  • 18
  • 99
  • 171
  • Pre-calculate and store the 'seed' (order) on the table – Allan S. Hansen Dec 23 '15 at 16:53
  • @AllanS.Hansen: a) I'd prefer not to have to alter the table if at all possible (but can if necessary). b) When I change the seed, I'd have to precalculate the entire table again. – Matt Burland Dec 23 '15 at 16:55
  • is space an issue? if not have a one to many table linked to it with the primary key of the table & the seed as the primary key and the value as a 3th column? – Fredou Dec 23 '15 at 17:08
  • Seems that you have a bit of contradiction here. You want random rows but you also want to make it predictable and repeatable. That goes against the concept of random rows. You could maybe store the primary keys from a given random set in another table with your seed value. Then you could retrieve the same rows again and generate a new set of random values when you want a new seed. – Sean Lange Dec 23 '15 at 17:12
  • 1
    You say you want this to run on the server-side, but are you looking for a LINQ query, SQL, or does it matter? As you point out, this is best solved by a seedable random so as long as you don't mind adding another problem, SQL CLR might be a way around it. – M.Babcock Dec 23 '15 at 17:15
  • since the table is 5 million rows you can just modulo the ids by some number less than 5 Million/120 and then just use another seed to off set things or another seed to skip a certain amount of rows – johnny 5 Dec 23 '15 at 17:18
  • Out of the 5.5 million rows how many match the `WHERE` and would be eligible for being picked as one of the 120? – Martin Smith Dec 23 '15 at 17:28
  • @MartinSmith: Right now it's about 11k rows. – Matt Burland Dec 23 '15 at 17:40
  • @M.Babcock *"but are you looking for a LINQ query, SQL, or does it matter?"* Ultimately, what matters is that it isn't too slow. – Matt Burland Dec 23 '15 at 17:41
  • 1
    @MattBurland - So the time is probably spent in the `WHERE` not the `ORDER BY` anyway. (RE: glacially slow). If you can use an index to find the matching rows efficiently probably the time spent sorting 11K rows will not be noticeable. – Martin Smith Dec 23 '15 at 17:42

2 Answers2

2

If you want something that looks random then mix your [PrimaryKey] with some other data...

SELECT *
FROM [your table]
ORDER BY
    CHECKSUM([primarykey]) ^ CHECKSUM('your seed') 

... this will still be a table scan but it should have better performance then pulling the entire set of data do your client just to throw away everything except 120 rows.

Matthew Whited
  • 22,160
  • 4
  • 52
  • 69
0

Assign a unique guid to each row as an indexed column (but not a clustered one). Choose three alpha-numeric characters randomly, and then select all rows where the guid starts with those characters.

36*36*36=46,656
5,500,000/46,656 ~= 117.88

Since newid() doesn't follow a pattern, that provides your random grouping, evenly distributed, and if you use the same three characters you'll always get the same data back, which covers the seeding.

If that isn't performant enough, create another column to specifically index the first three characters.

(Sorry for the brevity - on my phone)

Bobson
  • 13,498
  • 5
  • 55
  • 80