3

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.

What is the order of this LINQ query? Is there a more efficient way of doing this?

var MyValues = from r in MyDataTable.AsEnumerable()
               orderby r.Field<double>(_varName)
               select r.Field<double>(_varName); 

IEnumerable result= MyValues.Distinct();
gunr2171
  • 16,104
  • 25
  • 61
  • 88
sammarcow
  • 2,736
  • 2
  • 28
  • 56
  • 1
    The best way to determine this is to run the LINQ query while SQL profiler is running. You can see the actual SQL queries, and generate an execution plan in SSMS based on that. – Codeman Oct 01 '12 at 20:56
  • What do you mean by "order of this LINQ query?" Are you wondering in which order the statements execute? – Cᴏʀʏ Oct 01 '12 at 21:00
  • 1
    @Pheonixblade9: Actually, by calling `AsEnumerable()` the SQL script is very basic. Both the `orderby` and `select` are ran locally. – Guvante Oct 01 '12 at 21:00
  • @L.B: I'm guessing his query to populate `MyDataTable` is doing a `GROUP BY` or `DISTINCT`; I'm guessing the above LINQ is a LINQ-to-Objects query. – Cᴏʀʏ Oct 01 '12 at 21:00
  • @Pheonixblade9: Linq-To-DataSet is a subset of Linq-To-Objects. – Tim Schmelter Oct 01 '12 at 21:03
  • @Cory I assume that by "order" he means order of magnitude, i.e. "big-oh notation" – Servy Oct 01 '12 at 21:06
  • @L.B Sorry I actually use MyDistinct.Distinct() and am calling that.. can you provide a way to do it in a single line? – sammarcow Oct 01 '12 at 21:06

4 Answers4

6

I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).

Update: the LINQ Distinct() call should also be O(n).

So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.

Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194
1

Is there a more efficent way of doing this?

You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
1

from comments

I actually use MyDistinct.Distinct()

If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.

You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.

var values = from r in MyDataTable.AsEnumerable()
             select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
                                                  .OrderBy(d => d);

I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:

Order of LINQ extension methods does not affect performance?

Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:

Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
0

your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.

the order of the query is, if you see the underlying command, would be transalated to

Select <_varname> from MyDataTable order by <_varname>

and this is as good as it is when run on the database IDE or commandline.

to give you a short answer regarding performance

  1. put in a where clause if you can (with columns that are indexed)
  2. ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
  3. Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
  4. profile your underlying query,
  5. if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell

it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)

Hope this helps

that said, if your db is not properly indexed, meaning

Krishna
  • 2,451
  • 1
  • 26
  • 31