3

Scenario

I have a database consisting of a table called Problems which in turn consists of 2 columns:

  • ID
  • Problem

I populated the tables with millions of records.

In my application, I want to retrive a set of N distinct records taken at random to make a problem sheet. A problem sheet consist of N problems. The randomness should be good enough to make sure I don't produce an problem sheet that is similar to the previous examination.

What is the simplest way to do so using Linq?

Second Person Shooter
  • 14,188
  • 21
  • 90
  • 165
  • 2
    *The randomness should be good enough to make sure I don't produce an problem sheet that is similar to the previous examination.* - If the randomness is good then repeats *will* happen. If you exclude some possibilities the result will be *less* random. – Mark Byers Jan 18 '11 at 06:50
  • 2
    Define `unique`. Is ID a PK (hence unique)? – leppie Jan 18 '11 at 06:50
  • @ivo - looping speed is irrelevant if you use LINQ to push the work to the DB, and IIRC the idea here is to **not** touch many of the records... – Marc Gravell Jan 18 '11 at 07:12
  • @ivo - looping speed is also pretty irrelevant if you *are* fetching data, as it will be virtually nothing compared to the IO. To summarise: in most cases, looping speed is irrelevant – Marc Gravell Jan 18 '11 at 07:15

3 Answers3

3

Something like:

IQueryable<Problem> problems = db.Problems;
int count = problems.Count();
List<Problem> sheet = new List<Problem>(n);
Random rand = new Random();
while(sheet.Count < n) {
    var next = problems.OrderBy(p => p.ID).Skip(rand.Next(count))
        .FirstOrDefault();
    if(next != null && !sheet.Any(p => p.ID == next.ID) {
        sheet.Add(next);
    }
}

?

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    +1 and I think a `HashSet` is better? No `if` condition, just `Add` it. – Cheng Chen Jan 18 '11 at 07:14
  • 1
    @TheSmartest you could hash the ids, but you'd still need something like `if(uniqueIDs.Add(next.ID)) sheet.Add(next)` - but indeed, probably faster – Marc Gravell Jan 18 '11 at 07:16
  • 1
    @Marc: Is skipping records expensive because I think (I might be wrong) skipping itterates over a huge of records? – Second Person Shooter Jan 18 '11 at 07:21
  • 1
    @xport - no, that can be done at the db - run it through a SQL trace - it should map to something involving `ROW_NUMBER() OVER(ORDER BY ID)` – Marc Gravell Jan 18 '11 at 07:25
  • 2
    For example (TSQL, not LINQ): `select x.* from ( select *, ROW_NUMBER() OVER (ORDER BY Id) as [__row] from Customers) x where x.__row BETWEEN 1000 AND 1010` – Marc Gravell Jan 18 '11 at 07:26
  • Can you do this without N trips to the DB? – Gabe Jan 18 '11 at 07:35
  • @Gabe trickier; not with pure LINQ you can't; however, you could do it with a table-variable; I'll add another version (as a separate answer, as IMO it is)... – Marc Gravell Jan 18 '11 at 07:42
  • Store the ordered list outside of the loop, instead of repeatedly sorting it. Lets save some cycles :) – Mahesh Velaga Jan 18 '11 at 10:00
  • @Mahesh there **is no** ordered list in the example; the `OrderBy` in the example is an instruction to LINQ and is necessary to get predictable results. – Marc Gravell Jan 18 '11 at 11:55
  • @Marc I was looking to split the `problems.OrderBy(x => x)` and move it out of the `while` loop. As long as its IEnumerable this probably wouldn't matter, but if we convert this to list outside the loop, wouldn't it be better? – Mahesh Velaga Jan 18 '11 at 18:41
  • @Manesh - no, strangely enough loading a million records won't make this better. It **isn't** (just) IEnumerable; it is IQueryable; the OrderBy is being "composed" into a DB query that **only** loads n records (one per iteration). So if n=50, we might load 51 records instead of 1 million. I know which I prefer. – Marc Gravell Jan 18 '11 at 19:21
2

Generating a random permutation of Problems and selecting first N elements is what you need.

Problems.OrderBy(n => Guid.NewGuid()).Take(N)

But this query is not a best way to do this job, as you don't want to generate million Guid's just to take N elements.

There is no simple linq query to do this job, but you can first generate N random numbers in [0,Problems.Count()) and then use that to pick problems.

    IEnumerable<int>  GenrateIDs(int max)
    {
        Random rand = new Random();
        HashSet<int> IDs = new HashSet<int>();
        while (IDs.Count < max)
        {
            IDs.Add(rand.Next(max));
        }
        return IDs.AsEnumerable();
    }
Rozuur
  • 4,115
  • 25
  • 31
  • Yes, I've [used that before](http://stackoverflow.com/questions/648196/random-row-from-linq-to-sql), but this can get expensive en-masse – Marc Gravell Jan 18 '11 at 07:10
0

Here's a TSQL version to reduce the round-trips, using LINQ-to-SQL's ExecuteQuery<T> method:

List<int> ids = db.ExecuteQuery<int>(@"
declare @n int = {0}, @count int, @offset int, @id int
declare @ids table (pos int identity(1,1) not null, id int not null)
set @count = (select COUNT(1) from Problems)
if @count < @n set @n = @count
while @n > 0
begin
    set @offset = 1 + CAST(FLOOR(RAND() * @count) as int)
    select @id = (
                    select Id from (select Id,
                    ROW_NUMBER() over (order by Id) as __row
        from Problems) x where x.__row = @offset)
    if not exists (select 1 from @ids where id = @id)
    begin
        set @n = @n - 1
        insert @ids (id) values (@id)
    end
end
select id from @ids order by pos", n).ToList();
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • I think it would be much simpler to generate the unique random numbers in C# and just write a single SQL query with something like `WHERE x.__row IN (" + string.Join(",", randomNumbers) + ")` – Gabe Jan 18 '11 at 08:12
  • @Gabe - you'd need two round trips for that - one to get the count so you know how big the numbers can be – Marc Gravell Jan 18 '11 at 08:21
  • Yes, that's true. I was just thinking that 2 is still better than N. – Gabe Jan 18 '11 at 08:35