4

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.

I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.

The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.

The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.

Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.

Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.

Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.

Brute-force query:

SELECT      ProductID, [Rank]
FROM        (
            SELECT      p.ProductID, ptr.[Rank], SUM(CASE
                            WHEN p.ParamLo < si.LowMin OR
                            p.ParamHi > si.HiMax THEN 1
                            ELSE 0
                            END) AS Fail
            FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                        JOIN dbo.ProductDefs AS pd
            ON          pd.ParamTypeID = si.ParamTypeID
                        JOIN dbo.Params AS p
            ON          p.ProductDefID = pd.ProductDefID
                        JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
            ON          ptr.ProductTypeID = pd.ProductTypeID
            WHERE       si.Mode IN (1, 2)
            GROUP BY    p.ProductID, ptr.[Rank]
            ) AS t
WHERE       t.Fail = 0

alt text

Index-based exception query:

with si AS (
    SELECT      DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
    FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                JOIN dbo.ProductDefs AS pd
    ON          pd.ParamTypeID = si.ParamTypeID
                JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
    ON          ptr.ProductTypeID = pd.ProductTypeID
    WHERE       si.Mode IN (1, 2)
)
SELECT      p.ProductID
FROM        dbo.Params AS p
            JOIN si
ON          si.ProductDefID = p.ProductDefID
EXCEPT
SELECT      p.ProductID
FROM        dbo.Params AS p
            JOIN si
ON          si.ProductDefID = p.ProductDefID    
WHERE       p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax

alt text

My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.

EDIT:

I have updated the indexes, and now have the following execution plan for the second query:

alt text

IamIC
  • 17,747
  • 20
  • 91
  • 154
  • 2
    When you say "uses brute force" - are you sure you know what is happening? The whole point of SQL is to give it a query, and let the optimizer work out the best way to process it. Just because a naive approach (based on, say, the order in which conditions are expressed) looks expensive, it doesn't mean the optimizer can't do a decent job. – Damien_The_Unbeliever Jan 01 '11 at 18:59
  • @Damien_The_Unbeliever yes, I know what is happening. Indexes are being used to get probable data, and the this is then evaluated manually row-by-row against a reference input (which has been expanded into these rows). When I say manually, I mean the data is grouped and processed using a CASE statement item-by-item. The expected way of using "normal SQL" ran at > 50 seconds. I know the last statement is vague. The point I'm making is slapping a simple query together and expecting the optimizer to figure it out didn't work well *in this case*. – IamIC Jan 01 '11 at 19:06
  • @IanC - okay, agreed, but in this case "brute force" is a little difficult to intepret. What is more brute force than giving the actual query to the optimizer and letting it figure out the answer? – Damien_The_Unbeliever Jan 01 '11 at 19:13
  • @Damien_The_Unbeliever I'm using what I understand the phrase "brute force" to mean, which is, (thanks Google) "In computer science, brute-force search or exhaustive search, also known as generate and test, is a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem..." – IamIC Jan 01 '11 at 19:16
  • @IanC - as opposed to where you let SQL (potentially) generate all possible rows, and then filter them via `WHERE` or `HAVING` clauses? That's the bit I don't understand - in SQL you frequently generate large *potential* result sets, but then filter them down to managable levels. – Damien_The_Unbeliever Jan 01 '11 at 19:20
  • By the way can you post the actual query? – Martin Smith Jan 01 '11 at 19:22
  • @Damien_The_Unbeliever now that I've posted the query, I trust you can see what I mean. – IamIC Jan 01 '11 at 20:14

4 Answers4

3

Trust the optimizer.

Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.

Don't concern yourself by considerations of how you might implement such a search.

In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.


In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.


I think I'd try:

        SELECT      p.ProductID, ptr.[Rank]
        FROM        dbo.SearchItemsGet(@SearchID, NULL) AS si
                    JOIN dbo.ProductDefs AS pd
        ON          pd.ParamTypeID = si.ParamTypeID
                    JOIN dbo.Params AS p
        ON          p.ProductDefID = pd.ProductDefID
                    JOIN dbo.ProductTypesResultsGet(@SearchID) AS ptr
        ON          ptr.ProductTypeID = pd.ProductTypeID

LEFT JOIN Params p_anti
    on p_anti.ProductDefId = pd.ProductDefID and
         (p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)


        WHERE       si.Mode IN (1, 2)

AND p_anti.ProductID is null

        GROUP BY    p.ProductID, ptr.[Rank]

I.e. introduce an anti-join that eliminates the results you don't want.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
  • I agree. There are two indexes because of the nature of what is being searched. Simply put, it's finding matches between a low and a high. Hence 2. With the brute force, I ignore this and simply eval each item. – IamIC Jan 01 '11 at 19:19
  • @IanC - However you've written this optimized query - it's having to use these two indexes independently. And it's having to scan them (examine every row) rather than seek. Without seeing the actual query, I can't add much else. – Damien_The_Unbeliever Jan 01 '11 at 19:24
  • 1
    I'm struggling to see how the two posted queries are related, given that `dbo.ProductTypesResultsGet(@SearchID)` isn't used in the second one at all – Damien_The_Unbeliever Jan 01 '11 at 19:51
  • What is really confusing me is I added a covering index: ProductDefID (ProductID, ParamLo, ParamHi) at the optimizer's suggestion. I then removed any other indexes against ParamLo & ParamHi, and this is running at the fastest yet: 4 sec. I don't get this. ParamLo & ParamHi are being specifically queried against. – IamIC Jan 01 '11 at 20:09
  • Thanks. However, on my current indexes, this isn't working. I aborted at 6 minutes. I'm now getting 4s with what I posted. – IamIC Jan 01 '11 at 20:27
1

In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.

goric
  • 11,491
  • 7
  • 53
  • 69
  • That can be an extremely unreliable measure as it relies on cardinality **estimates** even in the **actual** plan. `SET STATISTICS IO ON` can be unreliable as it doesn't include impact of scalar UDFs. As far as I've found the cpu,reads, and duration reported by profiler are the most reliable indicator. – Martin Smith Jan 01 '11 at 19:09
  • It reported the brute-force one as 64%, and the index-based one as 36%. That makes sense, except for the question: Why do they execute in the same time? – IamIC Jan 01 '11 at 19:13
  • @IanC - How accurate are the cardinality estimates? If you mouse over the arrows are there any big discrepancies between actual and estimated rows? – Martin Smith Jan 01 '11 at 19:14
  • The analyzer suggested I add a covering index, which I did (I need to investigate why it wanted this as it requested them on already indexed columns). This has shifted the results and made the query-based one faster. The estimates look reasonable - perhaps a bit conservative at the end - from what I can see. – IamIC Jan 01 '11 at 19:27
1

Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.

What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.

Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.

BTW how many rows does your source have and how many rows are included in the result set?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
RC_Cleland
  • 2,274
  • 14
  • 16
  • @RC_Cleveland I don't believe 6 sec (sometimes 4, depending on the direction of the wind) means much better on production hardware. Not unless SQL Server can really make use of multiple CPU cores for such a query. Re. the CI scans, some of them are a little confusing, i.e., I'm not sure why a seek isn't being done. I can do away with the large scan. Ironically, it makes the query a little slower, but I prefer the design as it will scale better (as you mention). – IamIC Jan 02 '11 at 06:25
  • @RC_Cleveland Re. your 3rd para, this is spot-on & I conclude that the real solution lies in adding another layer of hierarchy into the DB design. This will reduce the amount of data to be processed by 3 or more orders of magnitude. I don't see another solution. I had tried to avoid this, but processing > 3 million rows, which returns about 400k rows, is going to be slow. I suspect that 4 - 6s for such size is probably good. – IamIC Jan 02 '11 at 06:26
  • I also realized that in my test environment, some of the tables are lightly populated, which will automatically result in SQL Server choosing an index scan vs. a seek. Re. the CI scan, to clarify, that I can handle, although it slows the query down. – IamIC Jan 02 '11 at 06:41
0

Thank you all for your input and help.

From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.

There were too many records being returned to warrant use of the index.

See here (Kimberly Tripp).

IamIC
  • 17,747
  • 20
  • 91
  • 154