4

I have to query a table with few millons of rows and I want to do it the most optimized.

Lets supose that we want to controll the access to a movie theater with multiples screening rooms and save it like this:

AccessRecord
  (TicketId,
   TicketCreationTimestamp,
   TheaterId,
   ShowId,
   MovieId,
   SeatId,
   CheckInTimestamp)

To simplify, the 'Id' columns of the data type 'bigint' and the 'Timestamp' are 'datetime'. The tickets are sold at any time and the people access to the theater randomly. And the primary key (so also unique) is TicketId.

I want to get for each Movie and Theater and Show (time) the AccessRecord info of the first and last person who accessed to the theater to see a mov. If two checkins happen at the same time, i just need 1, any of them.

My solution would be to concatenate the PK and the grouped column in a subquery to get the row:

select
  AccessRecord.*
from
  AccessRecord
  inner join(
    select
      MAX(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MaxKey,
      MIN(CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId)) as MinKey
    from
      AccessRecord
    group by
      MovieId,
      TheaterId,
      ShowId
  ) as MaxAccess
    on CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MaxKey
    or CONVERT(nvarchar(25),CheckInTimestamp, 121) + CONVERT(varchar(25), TicketId) = MinKey

The conversion 121 is to the cannonical expression of datatime resluting like this: aaaa-mm-dd hh:mi:ss.mmm(24h), so ordered as string data type it will give the same result as it is ordered as a datetime.

As you can see this join is not very optimized, any ideas?


Update with how I tested the different solutions:

I've tested all your answers in a real database with SQL Server 2008 R2 with a table over 3M rows to choose the right one.

If I get only the first or the last person who accessed:

  • Joe Taras's solution lasts 10 secs.
  • GarethD's solution lasts 21 secs.

If I do the same accessed but with an ordered result by the grouping columns:

  • Joe Taras's solution lasts 10 secs.
  • GarethD's solution lasts 46 secs.

If I get both (the first and the last) people who accessed with an ordered result:

  • Joe Taras's (doing an union) solution lasts 19 secs.
  • GarethD's solution lasts 49 secs.

The rest of the solutions (even mine) last more than 60 secs in the first test so I canceled it.

Alex
  • 797
  • 10
  • 30

5 Answers5

1

Using analytical functions may speed up the query, more specifically ROW_NUMBER, it should reduce the number of reads:

WITH CTE AS
(   SELECT  TicketId,
            TicketCreationTimestamp,
            TheaterId,
            ShowId,
            MovieId,
            SeatId,
            CheckInTimestamp,
            RowNumber = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp, TicketID),
            RowNumber2 = ROW_NUMBER() OVER(PARTITION By MovieId, TheaterId, ShowId ORDER BY CheckInTimestamp DESC, TicketID)
    FROM    AccessRecord
)
SELECT  TicketId,
        TicketCreationTimestamp,
        TheaterId,
        ShowId,
        MovieId,
        SeatId,
        CheckInTimestamp,
FROM    CTE
WHERE   RowNumber = 1
OR      RowNumber2 = 1;

However as always with optimisation you are best suited to tune your own queries, you have the data to test with and all the execution plans. Try the query with different indexes, if you show the actual execution plan SSMS will even suggest indexes to help your query. I would expect an index on (MovieId, TheaterId, ShowId) that includes CheckInTimestamp as a non key column would help.

GarethD
  • 68,045
  • 10
  • 83
  • 123
1

Try this:

select a.*
from AccessRecord a
where not exists(
    select 'next'
    from AccessRecord a2
    where a2.movieid = a.movieid
    and a2.theaterid = a.theaterid
    and a2.showid = a.showid
    and a2.checkintimestamp > a.checkintimestamp
)

In this way you pick the last row as timestamp for the same movie, teather, show.

Ticket (I suppose) is different for each row

Joe Taras
  • 15,166
  • 7
  • 42
  • 55
  • I don't think a subquery for every record is going to be very fast. – Wietze314 Sep 09 '13 at 18:56
  • 2
    @Wietze314 Actually depending on the version of SQL-Server [NOT EXISTS will perform better than LEFT JOIN/ISNULL](http://stackoverflow.com/a/2246793/1048425), however I think from at least 2008, possibly even 2005 both will use the ANTI semi join so are both perform the same. – GarethD Sep 09 '13 at 18:58
  • You've used a left join. Try before ;) – Joe Taras Sep 09 '13 at 18:59
  • Ok I might have learned something new today then. I did have a lot of improvement in the past using the JOIN compared to the NOT EXISTS. – Wietze314 Sep 09 '13 at 19:04
  • 1
    @Wietze314: Hi, dear, yes in the past you're right. Now are on the same result ;) Have nice evening ;) – Joe Taras Sep 09 '13 at 19:15
  • 1
    @JoeTaras I'm completely amazed with the results. I didn't expect a `NOT EXISTS` to be a good solution. – Alex Sep 10 '13 at 12:18
0
SELECT
R1.*
FROM AccessRecord R1
LEFT JOIN AccessRecord R2
ON R1.MovieId = R2.MovieId
AND R1.TheaterId = R2.TheaterId
AND R1.ShowId = R2.ShowId
AND (
R1.CheckInTimestamp < R2.CheckInTimestamp
OR (R1.CheckInTimestamp = R2.CheckInTimestamp
AND R1.TicketId< R2.TicketId
))
WHERE R2.TicketId IS NULL

Selects the last entry based on the CheckInTimestamp. But if there is a match for this, then it is based on the highest TicketId

Offcourse an index on MovieId, TheaterId and ShowId will help

This is where I learned the trick

Wietze314
  • 5,942
  • 2
  • 21
  • 40
0

Add either new columns to the table and pre-convert the dates or join the pk in that access table to a new table which has the converted values sitting it it already. The new table that looks up the conversion instead of doing it on the join will speed things up in your queries immensely. If you can do it so that the access record gets an integer FK that goes to the lookup (pre-converted values) table then you're going to avoid using dates at all and things will be phenopminally faster.

If you normalize the dataset and break it out into a star pattern, things will get even faster.

Dennis
  • 103
  • 5
0

You could also consider a union ALL qwuery instead of that nasty OR. Ors are usually slower than union ALL queries.

HLGEM
  • 94,695
  • 15
  • 113
  • 186