0

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column

If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..

If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?

Would 1 record vs 10 record vs 100 record matter?

John Hartsock
  • 85,422
  • 23
  • 131
  • 146
djmc
  • 853
  • 9
  • 20

10 Answers10

2

As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:

  1. Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
  2. # of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
  3. # of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.

I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.

chucknelson
  • 2,328
  • 3
  • 24
  • 31
  • ok great, this is exactly the type of answer I was looking for. Yes you're generalizing, but it's definitely a good starting point for me to go on. thank you very much – djmc Nov 17 '10 at 21:26
2

Several things impact the cost of a query.

First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.

Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:

LIKE '%test'

Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.

Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.

Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.

Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.

There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • Yeah it's definitely much to complex to come up to speed just from a message board, but I totally appreciate your comprehensive answer. It points out a lot of things I will keep in mind or research further. Thanks a lot – djmc Nov 17 '10 at 21:44
1

Or should I write a query that just yanks all the columns..

No. Just today there was another question about that.

If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?

Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.

Community
  • 1
  • 1
littlegreen
  • 7,290
  • 9
  • 45
  • 51
  • I understand that SELECT * would be a perf hit compared to pulling specific fields, but it is almost a worthless answer if I find out there are other aspects of the sql query that have a much larger perf hit. I definitely appreciate the advice on retrieving rows and making sure I index my joins. I'm weighing the performance gains of selecting specific fields against the productivity I might gain by selecting it all and filter/ignoring the data in my c# code. I will update my original question in this regard. Feel free to weigh in if I'm making a bad choice. – djmc Nov 17 '10 at 21:34
  • What I don't understand is what you would gain by doing `SELECT *`. There isn't really an argument to support it. I don't think the performance hit will be that great, but it really depends on the particular situation. The only reasonable advice I can give you is to test it and to examine the query plans. Good luck – littlegreen Nov 17 '10 at 23:39
  • Well.. picture what nhibernate or some ORM mappers do. You don't get a list of partial objects, its always an instance of the object with all their fields defined. Even if you are only pulling an instance from the database to query the name and nothing else. I want to write less code by replacing all these individual queries I have that are pulling records from the same table, but differ only on what fields they're retrieving. Less t-sql for me to write with 1 single master SELECT *. – djmc Nov 18 '10 at 01:41
1

Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?

If possible, could you help me understand what aspects of a query would be relatively costly compared to one another?

Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.

JohnFx
  • 34,542
  • 18
  • 104
  • 162
  • Actually, I was hoping the number of columns selected would be an insignificant performance impact.. that way I could just select all the data. Then I dont have to worry about whether or not i'm using only a subset of those columns. In this regard, when you say "why would you want to select more data than you are going to use anyway", isnt that the same as "putting the horse before the cart"? – djmc Nov 17 '10 at 21:25
  • If you don't care about performance then go ahead, but you seem to be concerned about it. Not sure why you think selecting more columns would be free in terms of resources though. 1) SQL has to pull them down the wire 2) You are very likely to thwart SQL's ability to used covered indexes to speed up the data pulls. Of course the main point is, if you make it do more work, of course it will be slower. – JohnFx Nov 17 '10 at 21:36
  • No you're right. I guess I wasn't thinking clearly about the actual data being sent through the pipe.. and I was focused too much on the aspect of merely looking up the data. Thank you very much for the answer. – djmc Nov 17 '10 at 22:01
0

To answer the following:

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column

This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.

As for understanding preformance here is a good link

http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx

Or google tsql Performance

John Hartsock
  • 85,422
  • 23
  • 131
  • 146
  • 1
    The `SELECT *` syntax has the additional disadvantage of breaking after an `ALTER TABLE ADD COLUMN`. – dan04 Nov 17 '10 at 17:17
  • I'll definitely dig into your linked article and read it. But I guess now that I consider the question, it does seem a bit obvious that a larger number of columns would also affect the query badly just due to the nature of I/O performance spitting out more data.. – djmc Nov 17 '10 at 21:56
  • 1
    "Alter table add" will usually work... But try to remove one ;) – Heiko Hatzfeld Nov 17 '10 at 22:16
0

Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.

The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • ok great, I think I have a bit better perspective now that I'm thinking about how its obviously sending more data down the pipe.. I guess I'll have to consider that increase in data relative to how much data would be sent if I selected only specific columns. – djmc Nov 17 '10 at 21:58
0

What others are saying is all true.

But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.

Mark SQLDev
  • 539
  • 3
  • 6
0

The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:

A covering index is a special case where the index itself contains the required data field(s) and can return the data.

As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.

The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.

And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.

Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • wow, your answer definitely stands out from other people's because it is much more critical about selecting too many columns. I will definitely take your advice to heart and do some more research here. I assume your advice is given even for when I am selecting all the columns in just 1 table (as opposed to multiple tables joined). – djmc Nov 17 '10 at 21:53
  • The advice is given for one table or for many tables. Another angle to consider is that even if you do *not* have a covering index in mind, by adding unnecessary columns you are also blocking *later* add of a column (eg. a DBA that is asked to optimize the DB your app runs on cannot do anything useful because of your extra fields). There is also a client coding side effect: when considering this problem, LINQ queries should project new anonymous types with only the fields needed http://www.sqlskills.com/BLOGS/BOBB/post/MHO-LINQ-to-SQL-and-Entity-Framework-Panacea-or-evil-incarnate-Part-2.aspx – Remus Rusanu Nov 17 '10 at 22:24
  • Thanks. Definitely a lot of good stuff to research here.. I'll hold off before making some architecture decisions that I havent fully researched. – djmc Nov 18 '10 at 01:44
0

Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)

You allways have to consider what parts of the query will cause which costs.

If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).

The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.

Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.

The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.

Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)

Edit: Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.

Heiko Hatzfeld
  • 3,197
  • 18
  • 15
  • I definitely agree that SELECT * would lead to some unpredictable results and the scenario with the ProfilePicture definitely highlights one of those. However, I'm weighing my time as a developer.. and I'm wondering if its worth creating several individual queries compared to one general SELECT * type query. With a SELECT * query, I can decide to use it or ignore it in c# code. Whereas with individual sql queries, I have to maintain both the c# and the sql query... It's kinda a "developer productivity" versus "query performance" type of question. – djmc Nov 17 '10 at 21:39
  • Its not that hard to select the fields you want. If you have 2 similar queries, it might be ok to add one or 2 extra fields to cover some more bases, but you need to be aware that this will lead to maintenance issues later... Rermember that only 20% time is spend developing an App, and 80% maintaining it ;) – Heiko Hatzfeld Nov 17 '10 at 22:11
  • hhehe good point.. I guess I feel like a jerk trying to save me some time and end up creating more work for others down the road.. I guess I can just push for more time right now from management – djmc Nov 17 '10 at 22:17
0

I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)

The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.

I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.

RC_Cleland
  • 2,274
  • 14
  • 16
  • I'm definitely not a dba so I respect your guys' advice. Definitely glad I asked here as I don't think I had things straight in my mind as to where the performance bottlenecks were in a sql query. I was much more focused on lookups than I/O. Thanks – djmc Nov 18 '10 at 17:37