3

I have a table with a few million records in it. I want to get a sampling, where I return every nth record in the table. My current solution looks like this:

myQuery.Where((rec, index) => index % interval == 0);

However Linq to Entities does not support this and throws an exception. I also tried this:

myQuery.Select((rec, index) => new { Index = index, Record = rec })
       .Where(x => x.Index % interval == 0);

But Linq to Entities does not support that either.

Even if it did work that way, it doesn't seem very elegant. Is there another way of doing this?

Christofer Eliasson
  • 32,939
  • 7
  • 74
  • 103
Phil
  • 6,561
  • 4
  • 44
  • 69
  • pull all PK ids via myQuery.Select(c=>c.id).ToList() and then for loop with interval in the increment? – Aaron Anodide Dec 19 '11 at 20:56
  • 1
    @Gabriel. Your going to fetch a few million records and loop though it. That would be way to inefficient. – John Hartsock Dec 19 '11 at 20:58
  • i was thinking (but not sure) that entity framework isn't going to generate sql to do that and maybe a few million ints isn't that much on modern computer – Aaron Anodide Dec 19 '11 at 20:59
  • @Gabriel...let me know how that works out for you. – John Hartsock Dec 19 '11 at 21:01
  • 1
    similar to http://stackoverflow.com/questions/4208366/linq-to-sql-every-nth-row-from-table – hatchet - done with SOverflow Dec 19 '11 at 21:04
  • @JohnHartsock what if you loop the records directly without doing `ToList` than at least you don't need to keep all the entities in memory? Wouldn't that be like opening a stream to the table? – Magnus Dec 19 '11 at 21:06
  • @Magnus Sure, but with millions of records it would still take forever. If I can get the database to find every nth row, it would speed the query up quite a lot. – Phil Dec 19 '11 at 21:09
  • @Phil you probably gonna have to write a SQL statement using `row_number` in that case, don't think there is a linq solution. Although `row_number` is gonna need a temp table internally so it is probably gonna take some time also. – Magnus Dec 19 '11 at 21:13
  • @Phil I have a case where I need to fetch 2.3 million rows from the DB, I used a DataReader and it only took ~6 sec. But it will depend on what kind of data it is. – Magnus Dec 19 '11 at 21:15

5 Answers5

1

There is a similar Linq to SQL discussion going on here LINQ to SQL Every Nth Row From Table, check this out.

I think SQL is really the way to go. In EF you can put your SQL in a DefiningQuery

Community
  • 1
  • 1
Maxim Zabolotskikh
  • 3,091
  • 20
  • 21
1

It doesn't seem like you've defined an order by which n is the index of the table. Since data in a table in unordered, without specifying an order the record number is meaningless, and so the nth record is undefined. Does that mean you just want a certain random percent of the records? If so, an example of taking you a random one percent sample of the data would look like so:

Customer[] onePercentSample = db.Customers.Take(db.Customer.Count() / 100).ToArray();

This is not actually guaranteed to be a random sample. It simply returns an undefined sample out of all the records in the database, which could or could not be the same set across several queries. Again, this is due to the nature of a table being an unordered record set.

Allon Guralnek
  • 15,813
  • 6
  • 60
  • 93
0

Try ordering by index%interval. If that works, you can take the first interval/total records as your sample.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • Problem is, I can't get the index. Linq to Entities doesn't like `Where` or `Select` methods that include the index as a parameter. – Phil Dec 19 '11 at 21:00
  • Oh, I thought Index was part of your entity. Is there another entity field you can use for this, then? – zmbq Dec 19 '11 at 21:02
  • Nothing that would produce a nice, uniform sample. A random sample, sure. But not uniform. – Phil Dec 19 '11 at 21:05
  • In that case, you can just return all the entities and then do the iteration yourself. – zmbq Dec 19 '11 at 21:06
0

The direct SQL way is probably how I would go but if you have sequential Ids, you can leverage Min and Max:

Note I tested this with Linqpad which uses linq to sql - but I think the result will be the same in EF.

var a = ReceivedPayments.Select(c=>c.Id);
int interval = 50;
var ids = new List<int>();
int min = a.Min();
int max = a.Max();
for (int i = min; i < max; i++)
{
    if((i % interval) == 0)
    {
        ids.Add(i);
    } 
}

var b = ReceivedPayments.Where(c=>ids.Contains(c.Id));

Sql generated

SELECT MIN([t0].[Id]) AS [value]
FROM [ReceivedPayments] AS [t0]
GO

SELECT MAX([t0].[Id]) AS [value]
FROM [ReceivedPayments] AS [t0]
GO

-- Region Parameters
DECLARE @p0 Int = 50
DECLARE @p1 Int = 100
DECLARE @p2 Int = 150
DECLARE @p3 Int = 200
DECLARE @p4 Int = 250
DECLARE @p5 Int = 300
DECLARE @p6 Int = 350
DECLARE @p7 Int = 400
DECLARE @p8 Int = 450
DECLARE @p9 Int = 500
DECLARE @p10 Int = 550
DECLARE @p11 Int = 600
DECLARE @p12 Int = 650
DECLARE @p13 Int = 700
DECLARE @p14 Int = 750
DECLARE @p15 Int = 800
DECLARE @p16 Int = 850
DECLARE @p17 Int = 900
DECLARE @p18 Int = 950
DECLARE @p19 Int = 1000
DECLARE @p20 Int = 1050
DECLARE @p21 Int = 1100
DECLARE @p22 Int = 1150
DECLARE @p23 Int = 1200
DECLARE @p24 Int = 1250
DECLARE @p25 Int = 1300
DECLARE @p26 Int = 1350
DECLARE @p27 Int = 1400
DECLARE @p28 Int = 1450
DECLARE @p29 Int = 1500
DECLARE @p30 Int = 1550
DECLARE @p31 Int = 1600
DECLARE @p32 Int = 1650
DECLARE @p33 Int = 1700
DECLARE @p34 Int = 1750
DECLARE @p35 Int = 1800
DECLARE @p36 Int = 1850
DECLARE @p37 Int = 1900
DECLARE @p38 Int = 1950
DECLARE @p39 Int = 2000
DECLARE @p40 Int = 2050
DECLARE @p41 Int = 2100
DECLARE @p42 Int = 2150
DECLARE @p43 Int = 2200
DECLARE @p44 Int = 2250
DECLARE @p45 Int = 2300
DECLARE @p46 Int = 2350
DECLARE @p47 Int = 2400
DECLARE @p48 Int = 2450
DECLARE @p49 Int = 2500
DECLARE @p50 Int = 2550
DECLARE @p51 Int = 2600
DECLARE @p52 Int = 2650
DECLARE @p53 Int = 2700
DECLARE @p54 Int = 2750
DECLARE @p55 Int = 2800
DECLARE @p56 Int = 2850
DECLARE @p57 Int = 2900
DECLARE @p58 Int = 2950
DECLARE @p59 Int = 3000
DECLARE @p60 Int = 3050
DECLARE @p61 Int = 3100
DECLARE @p62 Int = 3150
DECLARE @p63 Int = 3200
DECLARE @p64 Int = 3250
DECLARE @p65 Int = 3300
DECLARE @p66 Int = 3350
DECLARE @p67 Int = 3400
DECLARE @p68 Int = 3450
DECLARE @p69 Int = 3500
DECLARE @p70 Int = 3550
DECLARE @p71 Int = 3600
DECLARE @p72 Int = 3650
DECLARE @p73 Int = 3700
DECLARE @p74 Int = 3750
DECLARE @p75 Int = 3800
DECLARE @p76 Int = 3850
DECLARE @p77 Int = 3900
DECLARE @p78 Int = 3950
DECLARE @p79 Int = 4000
DECLARE @p80 Int = 4050
DECLARE @p81 Int = 4100
DECLARE @p82 Int = 4150
DECLARE @p83 Int = 4200
DECLARE @p84 Int = 4250
DECLARE @p85 Int = 4300
DECLARE @p86 Int = 4350
DECLARE @p87 Int = 4400
DECLARE @p88 Int = 4450
DECLARE @p89 Int = 4500
DECLARE @p90 Int = 4550
DECLARE @p91 Int = 4600
DECLARE @p92 Int = 4650
DECLARE @p93 Int = 4700
DECLARE @p94 Int = 4750
DECLARE @p95 Int = 4800
DECLARE @p96 Int = 4850
DECLARE @p97 Int = 4900
DECLARE @p98 Int = 4950
DECLARE @p99 Int = 5000
DECLARE @p100 Int = 5050
DECLARE @p101 Int = 5100
DECLARE @p102 Int = 5150
DECLARE @p103 Int = 5200
DECLARE @p104 Int = 5250
DECLARE @p105 Int = 5300
DECLARE @p106 Int = 5350
DECLARE @p107 Int = 5400
DECLARE @p108 Int = 5450
DECLARE @p109 Int = 5500
DECLARE @p110 Int = 5550
DECLARE @p111 Int = 5600
DECLARE @p112 Int = 5650
DECLARE @p113 Int = 5700
DECLARE @p114 Int = 5750
DECLARE @p115 Int = 5800
DECLARE @p116 Int = 5850
DECLARE @p117 Int = 5900
DECLARE @p118 Int = 5950
DECLARE @p119 Int = 6000
DECLARE @p120 Int = 6050
DECLARE @p121 Int = 6100
DECLARE @p122 Int = 6150
DECLARE @p123 Int = 6200
-- EndRegion
SELECT [t0].[Id], [t0].[TxnID], [t0].[TxnDate], [t0].[TotalAmount], [t0].[Memo], [t0].[AppliedToTxnTxnID], [t0].[AppliedToTxnTxnType], [t0].[AppliedToTxnAmount], [t0].[FQPrimaryKey], [t0].[RefNumber], [t0].[ARAccount_Id], [t0].[Customer_Id]
FROM [ReceivedPayments] AS [t0]
WHERE [t0].[Id] IN (@p0, @p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8, @p9, @p10, @p11, @p12, @p13, @p14, @p15, @p16, @p17, @p18, @p19, @p20, @p21, @p22, @p23, @p24, @p25, @p26, @p27, @p28, @p29, @p30, @p31, @p32, @p33, @p34, @p35, @p36, @p37, @p38, @p39, @p40, @p41, @p42, @p43, @p44, @p45, @p46, @p47, @p48, @p49, @p50, @p51, @p52, @p53, @p54, @p55, @p56, @p57, @p58, @p59, @p60, @p61, @p62, @p63, @p64, @p65, @p66, @p67, @p68, @p69, @p70, @p71, @p72, @p73, @p74, @p75, @p76, @p77, @p78, @p79, @p80, @p81, @p82, @p83, @p84, @p85, @p86, @p87, @p88, @p89, @p90, @p91, @p92, @p93, @p94, @p95, @p96, @p97, @p98, @p99, @p100, @p101, @p102, @p103, @p104, @p105, @p106, @p107, @p108, @p109, @p110, @p111, @p112, @p113, @p114, @p115, @p116, @p117, @p118, @p119, @p120, @p121, @p122, @p123)
Aaron Anodide
  • 16,906
  • 15
  • 62
  • 121
-1

You could always do it with Skip and Take. Bet its horribly inefficient though.

int n = 10;
int currentIndex = 0;
yourEntity current;
List<yourEntity> lstSampling = new List<yourEntity>();

while((current = context.yourEntities.Skip(currentIndex).Take(1).FirstOrDefault()) != null)
{
    listSampling.Add(current);
    currentIndex += n;
}
Malcolm O'Hare
  • 4,879
  • 3
  • 33
  • 53