1

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.

I need to retrieve the latest 20 rows that satisfy a certain condition

Here's a rough example of my query

using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
    DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));

    IQueryable<Post> postQueryable =
        from postDbEntry in postHubDbContext.PostDbEntries
        orderby postDbEntry.Id descending
        where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
        select new Post(postDbEntry);

    postQueryable = postQueryable.Take(20);
    IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);

    return postOrderedQueryable.ToList();
}

The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?

Basically how do I make this query work efficiently with a database that has a billion rows ?

Erik Philips
  • 53,428
  • 11
  • 128
  • 150
Mohamed Heiba
  • 1,813
  • 4
  • 38
  • 67
  • Simple: If you have a proper index, EF should be wise enough to use it. – Kenan Zahirovic May 31 '14 at 21:42
  • I do have an Id column which is the primary key and is an index with seed 1, increment 1. – Mohamed Heiba May 31 '14 at 21:46
  • that said, is the query gonna realise the .Take(20) in the query or will it brutally select all rows that match query then take 20 – Mohamed Heiba May 31 '14 at 21:47
  • 1
    Create another index on DatePosted field, descending order. – Kenan Zahirovic May 31 '14 at 21:51
  • 1
    1) Take test DB 2) Enter query 3) Check execution plan. Those are trivial steps that you didn't perform. Tuning the query can be tricky but you should do at least basic things yourself. – Andrey May 31 '14 at 21:52
  • hmm that's an idea to consider, but then what number would I start from ? 99 billions ? or do i start with 0 and give it -ve numbers – Mohamed Heiba May 31 '14 at 21:52
  • actually that won't matter since entity framework will always query the database from top to bottom i think, i.e. it will always iterate through the rows from start till end, correct me if I'm wrong – Mohamed Heiba May 31 '14 at 21:53
  • @Andrey I'm developing against a local DB in debug environment. I will check execution plan but I was assuming queries will always work in a particular way – Mohamed Heiba May 31 '14 at 21:58
  • [Have you built a Geography Spatial Index](http://technet.microsoft.com/en-us/library/bb964714(v=sql.105).aspx)? – Erik Philips May 31 '14 at 22:16
  • @ErikPhilips not yet, but could you please explain how would that help my query run faster ? – Mohamed Heiba May 31 '14 at 22:19
  • 1
    [Long Explaination](http://www.mssqltips.com/sqlservertip/1976/sql-server-2008-spatial-index-performance/). The short explanation is that without a spatial index, the server MUST perform a distance calculation on EVERY single row. With an INDEX, the server can calculate which cells are outside the distance and exclude ALL rows in that cell. [Details on now Spatial Indexes actually work](http://sqlbits.com/Sessions/Event5/creating_high_performance_spatial_databases). – Erik Philips May 31 '14 at 22:37
  • 1
    FYI, when using entity framework if you do `postOrderedQueryable.ToString()` the string it returns is the SQL query it is going to generate on the server. – Scott Chamberlain May 31 '14 at 23:24

3 Answers3

2

According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.

This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.

So the only performance you can make is in database. Like @usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).

keiv.fly
  • 3,343
  • 4
  • 26
  • 45
1

Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table

Check out this question for more details How do I view the SQL generated by the Entity Framework?

Community
  • 1
  • 1
aL3891
  • 6,205
  • 3
  • 33
  • 37
  • I can't try it because I'm not gonna simply generate a billion rows just to test it. – Mohamed Heiba May 31 '14 at 21:44
  • Also I'm asking a theoretical question about how .Take works with an SQL query. I am already ordering by id descendinly and then taking the first 20 rows that match a criteria, what value is you answer adding – Mohamed Heiba May 31 '14 at 21:46
  • but you can see the sql and then look at the query plan, by doing that you can see if it scans the entire table or not – aL3891 May 31 '14 at 21:47
  • well I was hoping someone would tell me without me having to find the answer myself. how do i look at the query plan ? – Mohamed Heiba May 31 '14 at 21:49
  • no one at SO will be able to tell you how your program works. i updated the awnser with more info on viewing the sql – aL3891 May 31 '14 at 21:51
1

This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.

Assume both indexes exist:

If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.

If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.

So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.

Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

usr
  • 168,620
  • 35
  • 240
  • 369
  • good point, @usr. I guess you are right, index on DatePosted just for DatePosted isn't much of help. It could be helpful when combined with some other field in search criteria. – Kenan Zahirovic May 31 '14 at 22:05
  • I agree: if we have a table with billions of records, then we need to make a serious change in application. Also, I would consider partitioned tables/index, but it depends on DBMS. – Kenan Zahirovic May 31 '14 at 22:10
  • @KenanZahirovic partitioning is not a performance feature for queries. It usually hurts there. It is a management feature, for exmaple used to quickly load and delete bulk data. – usr May 31 '14 at 22:12
  • @usr ok so basically you're suggesting to make a spatial index column, am I right ? but then how would I use that to make the query faster ? will sort by spatial column or sth ? – Mohamed Heiba May 31 '14 at 22:13
  • @usr how does sql know whether a filter is selective or not very selective ? in my query I have 2 where conditions, does that mean it is selective ? – Mohamed Heiba May 31 '14 at 22:14
  • @Dv_MH I'm not an expert on SQL Server spatial. I believe it can be used for distance filtering, though. I can only recommend reading the docs and searching for "SQL Server filter by distance" to see how to speed this up. – usr May 31 '14 at 22:14
  • @Dv_MH SQL Server keeps statistics about the actual values in your columns. It tries to estimate/guess at compile time how many rows will be seen at runtime. – usr May 31 '14 at 22:15
  • @usr ok thanks for all the info so far btw. one last thing, Kenan suggested that I make another index column which is descending, would that work ? or does sql always starts quering from the beginning rows till end, what I would love is for sql linq query to query from the end rows first – Mohamed Heiba May 31 '14 at 22:18
  • 1
    SQL Server can scan an index both directions. There are a few limitations of the query processor here as this is not a well-tested feature. You better make a test. If the test comes out ok, an asc index is fine.; If you want to go beyond solving this single problem you should look into query tuning in general. It is not *that* hard to get a grasp on. A valuable skill. – usr May 31 '14 at 22:20
  • The simplest test is to write SQL query, something like: "select top 20 * from table order by DatePosted desc". First try without index (desc order) and then try with index. – Kenan Zahirovic May 31 '14 at 22:29