0

I've got a table with hundreds of thousands of rows. Each one relating to an item listed to be sold, along with when the item was placed on the market and when it was sold, (if it has been sold yet.)

I'm looking to calculate the number of items that were active in a given month with other given criteria, (the other criteria has enough permutations that pre-calculating these numbers would not be ideal.) I'm trying to retrieve 120 months worth of data and average it on a webpage so ideally the entire query would be as fast as possible. I've got a query that works... but it takes 5 seconds to run for only 12 months of data and I need it to be at least 10x faster than that so I can display a 10 year average in a decent time.

Here is an example of what I'm using now:

SELECT Avg(inv)
FROM   (
    SELECT
           CASE
                  WHEN Count(*) =0 THEN NULL
                  ELSE Count(*)
           END AS inv
    FROM   listings
    WHERE  originalentrytimestamp <= @DateOfMonthEnd
    AND    @DateOfMonthEnd < @currentMonthStart
    AND    (
                  offmarketdate > @DateOfMonthEnd
           OR     offmarketdate IS NULL)
    AND    city='New York'
    AND    listprice >= 0
    AND    listprice <= 999999999
    AND    category1='RESI'
    AND    category2 IN('D','A','S','R','O')
    UNION ALL
    SELECT
           CASE
                  WHEN Count(*) =0 THEN NULL
                  ELSE Count(*)
           END AS inv
    FROM   listings
    WHERE  originalentrytimestamp <= Dateadd(month, 1,@DateOfMonthEnd)
    AND    Dateadd(month, 1,@DateOfMonthEnd) < @currentMonthStart
    AND    (
                  offmarketdate > Dateadd(month, 1,@DateOfMonthEnd)
           OR     offmarketdate IS NULL)
    AND    city='New York'
    AND    listprice >= 0
    AND    listprice <= 999999999
    AND    category1='RESI'
    AND    category2 IN('D','A','S','R','O')
    UNION ALL
    SELECT
           CASE
                  WHEN Count(*) =0 THEN NULL
                  ELSE Count(*)
           END AS inv
    FROM   listings
    WHERE  originalentrytimestamp <= Dateadd(month, 2,@DateOfMonthEnd)
    AND    Dateadd(month, 2,@DateOfMonthEnd) < @currentMonthStart
    AND    (
                  offmarketdate > Dateadd(month, 2,@DateOfMonthEnd)
           OR     offmarketdate IS NULL)
    AND    city='New York'
    AND    listprice >= 0
    AND    listprice <= 999999999
    AND    category1='RESI'
    AND    category2 IN('D','A','S','R','O')
     )

I've only included 3 months in this code for brevity, but this quickly gets unwieldly if I want to run 120 months. It's also slow, but I'm not sure how to re-word it to use a window function or group by clause as a single listing might be counted in both January and February if it took a while to sell.

The case statement is to allow the code to be used to calculate year-to-date averages as well where the code might include a future month with no data and that should be excluded from the average, (but it wouldn't be necessary if the entire thing can be re-coded in a format that accepts a start and end month rather than hardcoding 120 queries.)

edit from comments discussion:

This table is a copy of the live data made daily and not written to mid-day, any indexes can be added or removed freely to speed up the query, (but the example where clauses for city, list price, etc are only examples, we have the indexes we have so far because we might search on different clauses there.)

O. Jones
  • 103,626
  • 17
  • 118
  • 172
Brian Heward
  • 524
  • 4
  • 14
  • Why don't you use `COUNT` over `PARTITION` by date in yyyy-MM format, in some way like `COUNT(*) OVER (PARTITION BY FORMAT(originalentrytimestamp, 'yyyy-MM'))`? – Alexey Jul 04 '22 at 17:58
  • @Alexey that sounds like what I want to do, but I can't get the syntax right. Can you post an example as an answer? When I tried to do that, I ended up with 4038 rows for the first month with a variety of values for count, (none of them matching the 4038 value I'd expect for the first month) – Brian Heward Jul 04 '22 at 18:09
  • What DBMS do you use? – Alexey Jul 04 '22 at 18:18
  • I'm running on MSSQL – Brian Heward Jul 04 '22 at 18:18
  • 2
    No - don't use FORMAT as suggested. That function is an expensive operation for the database engine. Use a calendar table - simple searching will find many discussions about how to do that. In addition, do you understand the significant difference between UNION and UNION ALL? Here you want the latter for a correct calculation. – SMor Jul 04 '22 at 18:38
  • the rows being union-ed are the monthly counts, I do want to keep duplicate counts as the average should be skewed more towards that number if 2 months happen to have the same inventory counts. So your right, I should fix that. – Brian Heward Jul 04 '22 at 18:40
  • 1
    Fyi case is an expression not a statement. – Dale K Jul 04 '22 at 19:02
  • 4
    For performance related questions we need to see the execution plan, because SQL is a declarative language where you are telling the SQL engine *what* results you want, not *how* to obtain them. So we need to see what decisions it made that caused a slow query – Dale K Jul 04 '22 at 19:03
  • @DaleK how do I post that? I can see that the operation that takes the most time is a clustered index seek that costs 7% and was run 13 times – Brian Heward Jul 04 '22 at 19:16
  • 1
    Please share the query plan via https://pastetheplan.com. Please also show us your table *and index* definitions. – Charlieface Jul 04 '22 at 19:33
  • For sheer performance from this exact query, I would consider a filtered index `(originalentrytimestamp, offmarketdate) INCLUDE (city, listprice, category1, category2) WHERE (city = 'New York' AND listprice >= 0 AND listprice <= 999999999 AND category1 = 'RESI' AND category2 IN('D','A','S','R','O'))` or possibly an indexed view. But it depends what inserts, updates etc you are running on this table as to whether that index would slow them down. Given the right index, there appears little to be had in terms of efficiency from the query, except for the `OR`. – Charlieface Jul 04 '22 at 19:42
  • execution plan: https://www.brentozar.com/pastetheplan/?id=rJtyNali5 table structure and indexes: https://pastebin.com/UxRTSDAx This table is a copy of the live data made daily and not written to mid-day, any indexes can be added or removed freely to speed up the query, (but the example where clauses for city, list price, etc are only examples, we have the indexes we have so far because we might search on different clauses there.) – Brian Heward Jul 04 '22 at 19:58
  • 1
    Please do [edit] the execution plan into your question - comments get removed eventually. – Dale K Jul 05 '22 at 00:20
  • This kind of problem may benefit from using a non-clustered columnstore index. It can help you do report-style queries more quickly than in row mode over b-trees. Please also consider either removing the parameters or using option(recompile) as well, at least for testing, to see what query performance you can achieve – Conor Cunningham MSFT Jul 05 '22 at 12:38
  • @Charlieface plan and table definition edited into question. – Brian Heward Jul 05 '22 at 16:01
  • RE: "This table is a copy of the live data .....". It may help adding persisted computed column(s) that would store year and month data either as separate columns or as a single integer column (yyyymm e.g. 202207). You should be able to then write `PARTITION BY` or `GROUP BY` statements easily. – Alex Jul 06 '22 at 02:10
  • If you use SSMS -- Microsoft's SQL Server Management Studio -- this tip is for you: Right-click in a query window, then select Show Actual Execution Plan, then run the query. The execution plan display sometimes recommends a new index to create. – O. Jones Jul 06 '22 at 11:05

2 Answers2

0

Try a covering index on (city, category1, category2, originalentrytimestamp) including the offmarketdate and listprice columns. DDL like this might create the index you need.

CREATE NONCLUSTERED INDEX [CityCategoriesTimestamp] ON [dbo].[listings]
(
    [city] ASC,
    [category1] ASC,
    [category2] ASC,
    [originalentrytimestamp] ASC
)
INCLUDE (
    [offmarketdate] ASC,
    [listprice] ASC
);

This works because the index can be scanned sequentially from the first eligible row to the last. That's faster than scanning the whole table.

And, consider reworking your query to use EOMONTH(). Get rid of the big UNION ALL cascade, instead doing GROUP BY(originalentrytimestamp). I'd suggest a rewrite but I'm not sure I understand your business logic well enough to do that.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • A single row in the table can fit in more than one of the select statements if it took more than a month to sell. How would a group by originalentrytimestamp work with that? Also originalentrytimestamp is almost a unique key, (some things might happen to have been listed at the exact same time, but that's unlikely) I'm not sure how using that as a group by would help without a full example. (I would love to remove the cascading union alls, just not sure how to maintain the business logic while doing it.) – Brian Heward Jul 06 '22 at 17:03
0

I ended up getting around this by doing a pre-calculation of these inventory levels for a bunch of possible criteria. It's not ideal as it limits what other criteria I can apply to this search and the pre-calc table is taking a significant amount of space, but it will work for the specific business case I'm working with.

Brian Heward
  • 524
  • 4
  • 14