19

I've searched around stackoverflow but everybody asks to optimize queries they've already done.

I want to know, the basic stuff on what to do, what to avoid when creating a query.

For example, It's a known fact that Writing SELECT * FROM is a thing to avoid, given that The sql engine has to make an "invisible" query to know what columns should be shown.

Also know that between @min_number AND @max_number works better than Id >= @min_number AND Id <= @max_number but I don't recall why. It can be because between is a sentence controlled at a lower level by the engine, and creates the iterations to show the regs somehow "handled". But I just don't know for sure.

Could someone validate those and make a list of the most common what to do, what to avoid ?

apacay
  • 1,702
  • 5
  • 19
  • 41
  • I'd argue with the 'select * from' case. Depending on how the particular SQL compiler (each database engine compiles the SQL it receives to some kind of byte-code) works it might or might not require extra work. And you didn't specify any particular database. – Jan Hudec May 02 '11 at 13:38
  • 1
    Whether `between` works better than `<` and `>` again depends on the particular SQL optimizer. With between it always see it's a range, but it may or may not be of use while with `<` and `>` it may miss it's a range. – Jan Hudec May 02 '11 at 13:40
  • BETWEEN is not the same `>` `<` for one... it includes equalizy – gbn May 02 '11 at 14:25
  • @Jan if you've got some SQL compiler bounded best practice please write it so. With `between` against `<` and `>` I'm considering that what you need is solvable by the `between`. – apacay May 02 '11 at 14:35
  • @gbn Could you detail a bit more what you mean? – apacay May 02 '11 at 15:32
  • BETWEEN is `>=` ... `<=` – gbn May 02 '11 at 15:34
  • @gbn ty, perhaps my question sounded a bit foolish but english is not my native lang so... _-lify_ is a suffix I had not in my dictionary.. now I do. _Edit_: Corrected – apacay May 02 '11 at 17:56

9 Answers9

16

My list is SQL Server specific (I'm sure that are lots more):

Use sargable where clauses - that means no functions especially scalar UDFs in where clauses among other things

WHERE NOT EXISTS tends to be the faster choice than a left join with a where id is null structure when you are looking for those rows which don't match a second table.

Correlated subqueries tend to run row by row and are horribly slow.

Views that call other views can't be indexed and become very slow especially if you get several levels in on large tables.

Select * is to be avoided espcially when you have a join as at least one column is sent twice which is wasteful of server and database and network resources.

Cursors can usually be replaced with much faster performing set-based logic When you store data in the correct way, you can avoid alot of on-the-fly transformations.

When updating, make sure you add a where clause so that you don't update rows where the new value and the old value are the same. This could be the differnce between updating 10,000,000 rows and updating 15. Sample (Tsql Update structure, if you use another db, you may have to lookup the correct syntax, but it should give you the idea.):

Update t
set field1 = t2.field2
from table1 t
join table2 t2 on t.tid = t2.tid
Where t.field1 <> t2.field2

Or

Update t
set field1 = @variable
from table1 t
Where t.field1 <> @variable

Check your indexing. SQL Seerver does not automatically index foreign keys. If they are used in a join, they generally need to be indexed.

If you are constantly using functions on a field, you are probably not storing it correctly (or you should have a persisted calculated field and do the transformation only once not every time you select the column.)

You best bet is to get a good performance tuning book for your database of choice (what wokrs best is very database specific) and read the chapters concerning writing queries.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • Thanks @HLGEM Is SQL Server specific because this doesn't happen on other engines or because you work on SQL Server, have the specialization on it and only talk of your knowledge in that compiler? – apacay May 02 '11 at 14:39
  • Performance tuning is database specific. I suspect some of these will also be the same on other database engines, but do not know as I only work in depth with SQl Server. That is why it is important to read about performance tuning for the specific dbs you work with. And it is also why so many COTS programs which are not tied to one DB are so horribly bad at performance. – HLGEM May 02 '11 at 14:50
  • I'm in the same you are. I work almost exclusively with MS SQL Srv. Could a community wiki be written here? I would compile all you people have been saying. – apacay May 03 '11 at 18:43
7
  • Views are macros, not magic
  • EXISTs and NOT EXISTs work best usually
  • Functions on columns (see Joel C's answers)
  • Beware implicit conversion (eg smallint column compared to int parameter)
  • Understand covering indexes
  • Denormalise after you see issues
  • Understand aggregates: stop thinking of loops
  • ...

Edit, Feb 2012:

Avoid these "Ten Common SQL Programming Mistakes"

Community
  • 1
  • 1
gbn
  • 422,506
  • 82
  • 585
  • 676
  • 2
    +1 to "Stop Thinking of Loops". Set-based thinking is the hardest thing for most people new to databases. – N West May 02 '11 at 15:15
5

In your WHERE clause, avoid using a column as an input to a function, as this can cause a full table scan instead of being able to use an index. The query optimizer on some platforms does a better job than others, but it's generally better to be safe. For instance, if you're looking for records from the past 30 days, do the data manipulation against the date you're comparing against, not against your column:

BAD

WHERE DATEADD(DAY, 30, [RecordDate]) > GETDATE()

This may cause a full table scan (depending on the query optimizer for your platform), even if [RecordDate] is indexed, because DATEADD(DAY, 30, [RecordDate]) has to be evaluated to compare it against GETDATE(). If you change it to:

BETTER

WHERE [RecordDate] > DATEADD(DAY, -30, GETDATE())

This will now always be able to use an index on [RecordDate] regardless of how good the query plan optimizer is on your platform, because DATEADD(DAY, -30, GETDATE()) gets evaluated once and can then be used as a lookup in the index. The same principle applies to using a CASE statement, UDF's, etc.

Joel C
  • 5,547
  • 1
  • 21
  • 31
4

A few general points about optimizing queries:

  • Know your data. Know your data. Know your data. I would venture to guess that half of all database performance problems stem from an incomplete understanding of the data and the requirements of the query. Know if your query will be usually returning 50 rows or 5 million rows. Know if you need to get back 3 columns or 50 columns. Know what columns are key columns on the tables, and filter on these.

  • Understand your database structure. If you're working with a database in third-normal form, recognize that this structure typically works best on queries for lots of small, transactional statements operating on individual rows. If you are working in a star or snowflake design, recognize that it's optimized for large queries and aggregations.

N West
  • 6,768
  • 25
  • 40
  • 1
    @N This is most useful! ty for your contribution. However the question points to a syntactic optimization. I have no doubt this is what someone semantically sholud know. But this isn't the answer I was looking for. – apacay May 03 '11 at 18:38
2

Here is a good link about Best Practices and performance on SQL server. http://www.sql-server-performance.com/articles/dev/sql_best_practices_p1.aspx

Magnus
  • 45,362
  • 8
  • 80
  • 118
1

I cant actually validate your claim but can say that not using * sounds quiet logical, what i can do is add a point or two to them, if you can along with giving a select columnname from tablename add a where clause it helps a lot, since you would cut down on a lot of unnecessary rows and rows of data that may be pulled up, also avoiding cross joins and welcoming inner joins, outer joins or fuller joins should be the way to go as per my personal experience :)

Mohammad
  • 9
  • 1
0

My simple rules to write a query:

  1. Write FROM clause from the most smallest table. This helps to find data more efficiently as we make searching in smaller amount of data.

  2. At first you should write INNER JOIN, then LEFT OUTER JOIN. This helps to decrease quantity of rows where SQL Engine will search your data.

    For example:

    SELECT 
        pe.Name,
        de.Name,
        bu.Name
    FROM dbo.Persons pe
    INNER JOIN dbo.Departments de ON pe.ID = de.id_Person -- at first INNER JOIN
    LEFT JOIN dbo.Bureau bu ON bu.ID = de.id_Bureau -- then LEFT OUTER JOIN
    
  3. Use aliases and schema name to avoid schema scanning by SQL Server. As using schema name helps to cashe your query plan for ad-hoc queries that can be reusable by other users, not only for your queries.

  4. Avoid using SELECT * ...

StepUp
  • 36,391
  • 15
  • 88
  • 148
0

From what I've read, using BETWEEN instead of two checks on an index, using AND, improves performance because your database may not fully utilize the benefits of indexes when it sees that it is used on both sides of an AND, or OR.

The query optimizer may not be able to intuit that this is a range check and that the index sorting can come in handy. Instead it may do a scan on each condition and then combine the results. On the other hand, this is very clear with a BETWEEN clause that compares the index column to two values.

0

adding some tips to the list :

Using EXISTS/NOT EXISTS in place of IN/NOT IN for indexed columns

 --instead of 
 SELECT * FROM table1
  WHERE id1 NOT IN (SELECT id2 FROM table2)

 --you better write
 SELECT * FROM table1 WHERE NOT EXISTS (SELECT 1 FROM table2 WHERE id1=id2)  

Avoid using UNION when its possible to use UNION ALL
when you dont need to exclude duplicated rows or you are sure it wont return duplicated rows

Avoid using HAVING when its possible to use WHERE

 --instead of 
 SELECT col1, sum(col2) 
   FROM table1
  GROUP BY col1
 HAVING col1 > 0

 --you better write :
 SELECT col1, sum(col2)
   FROM table1
  WHERE col1 > 0
 GROUP BY col1

Use EXISTS instead of DISTINCT when you have one-to-many table joins

--instead of
SELECT distinct a.col1, a.col2
  FROM table1 a, table2 b
 WHERE a.id = b.id

--you better write
SELECT a.col1, a.col2
  FROM table1 a
 WHERE EXISTS (SELECT 1 FROM table2 b where a.id = b.id)  

I hope this few tips helps, looking forward more tips ;)

mcha
  • 2,938
  • 4
  • 25
  • 34
  • I think most databases now days create the same query plan both for IN and EXISTS – Magnus May 02 '11 at 14:57
  • 1
    @Magnus: correct, but NOT IN and NOT EXISTS are very different. @mcha: you used an ANSI-89 implicit join in the last example... should be ANSI-92 EXPLICIT join ;-) – gbn May 02 '11 at 15:21
  • @gbn Did some query tests on `NOT EXISTS` and `NOT IN` on SQL server 2008, they all created the same plan. – Magnus May 02 '11 at 15:32
  • @mcha I've never actually had the need to use Having, is there somewhere where `HAVING` can not be replaced by `WHERE`? – apacay May 02 '11 at 15:34
  • @Magnus: ..and if you have NULL in the "NOT IN" list. Do you get the same results? Whatever the plan says, they are *semantically different constructs* and can give *different* results. Which is why you used NOT EXISTS for safety... – gbn May 02 '11 at 15:39
  • 2
    @apacay You use HAVING to use an aggregate in a comparison, such as `HAVING SUM(OrderDetail.Total) > 1500` – Joel C May 02 '11 at 16:05
  • @gbn Ok, I'll test that. (Myself, I always use EXISTS) – Magnus May 02 '11 at 16:16