2

I have the following query that I've ran:

SELECT TOP 100 certs.CertId, COUNT(cluster.BGTJobId) C
FROM [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster 
INNER JOIN [Eagle].[raw].[certs] certs 
  ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId

Ultimately, I want to get the full results and not just the top 100, but for previewing purposes, is this the fastest way to go?

Ecanales
  • 81
  • 1
  • 8
  • 2
    The `SELECT TOP` clause is used to specify the number of records to return... – tomloprod Nov 07 '16 at 16:51
  • 4
    I don't know what a preview is. – Strawberry Nov 07 '16 at 16:51
  • 3
    This depends on what you mean by preview-- what do you mean? – Hogan Nov 07 '16 at 16:52
  • 4
    I guarantee that `TOP 1` will be faster. – Radu Gheorghiu Nov 07 '16 at 16:53
  • @Strawberry I just mean a smaller amount of rows, to get an idea of whether my database and query make sense – Ecanales Nov 07 '16 at 16:59
  • 4
    No it is not guaranteed to be faster - [This question](http://stackoverflow.com/q/6286837/1048425) is case in point - despite the original query only returning 35 rows, adding top 30 actually increases the execution time by a factor of 10-100. Adding `TOP` will *usually* make the query faster, but it can completely alter the execution plan, the introduction of a `TOP` resulting in a switch from a merge join to a nested loop join, and a massive increase in execution time. – GarethD Nov 07 '16 at 17:01
  • @GarethD The question you pointed to has a `ORDER BY`, whereas the query in this scenario doesn't. I'd still be willing to bet that in this scenario `TOP 1` will be faster. – Radu Gheorghiu Nov 07 '16 at 17:06
  • @RaduGheorghiu just ran TOP 1 and it took 14 minutes, just like the TOP 100. Time to pay up – Ecanales Nov 07 '16 at 17:28
  • @Ecanales Have you used `SET STATISTICS TIME ON` to test the difference? Anyway, since you mentioned the execution time of the query then your problem is elsewhere. The `GROUP BY` that you're doing is making an implicit sort of the data, which in your case (I'm assuming you have a lot of rows). – Radu Gheorghiu Nov 07 '16 at 18:07
  • 2
    @Ecanales You could add more information and show the execution plan of your query by pasting the XML of the plan [**here**](https://www.brentozar.com/pastetheplan/) and sharing it with the rest of us. Still, `TOP 1` is faster, regardless, because it is retrieving fewer rows and displaying fewer rows. Even if it's marginally faster than `TOP 100`, it still is. – Radu Gheorghiu Nov 07 '16 at 18:21
  • Based solely upon the fact that top 100 and top 1 both take 14 minutes, I would guess the optimiser is doing a hash aggregate with one or more clustered index scans on big tables involved, it doesn't matter what limit you apply, the full scan still has to take place to get an accurate count because something that eventually ends up on the top x, might be both the first and last record scanned. – GarethD Nov 07 '16 at 19:23
  • You either need to review your indexes, or if it is only for preview purposes you might consider using [`TABLESAMPLE`](https://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx) - rather than selecting the top x rows of the result, it will just return a sample number of rows from each table, thus reducing the amount of work in the aggregate. – GarethD Nov 07 '16 at 19:26
  • I think both @GarethD and RaduGheorghiu are partially right. Gareth is right when he says that it has to run the whole process regardless of the TOP 100. And Radu is right when he says that the long execution time is due in part to table design. I've found little difference between executing a TOP vs normal query in terms of execution time. – Ecanales Nov 07 '16 at 19:34
  • You say this is for preview purposes, but bear in mind that `TOP` without `ORDER BY` is "whichever rows are most convenient to obtain". This could mean that the sample of rows you retrieve are highly unrepresentative of the whole population of rows. – Damien_The_Unbeliever Nov 08 '16 at 07:25

2 Answers2

1

Since you've mentioned this is for preview purposes, so I'm assuming you just want data out of the query and you want it to run FAST regardless of the data it returns, and seeing that you mentioned that the query takes 14 minutes to execute, a quick 'hack-fix' would be to use something like below:

SELECT
    certs.CertId
    , COUNT(cluster.BGTJobId)
FROM
    (SELECT TOP 100 
        certs.CertId
    FROM [Eagle].[raw].[certs] certs) certs
INNER JOIN [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster 
    ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId

Aggregating data (in your case COUNT) is a very expensive operation and should be done only at the last part of the query on as little data as possible. That is why, for "preview" purposes I have selected onyl the first 100 certificates and made the COUNT on that data.

However, because you mentioned that the query takes 14 minutes to run, the problems are elsewhere and usually this is due to design (query design, index design or even table design).

  • You should ask yourself if you really want to go over all of the data in the tables and get all of the matching rows from both tables, and aren't you possibly missing a WHERE clause?

  • If you do decide that there is a WHERE clause needed, are there any indexes to help filtering the data based on the conditions of your WHERE clause (and even the join columns - certs.BGTJobId and cluster.BGTJobId?

Radu Gheorghiu
  • 20,049
  • 16
  • 72
  • 107
0

Yes top select query is fastest for preview purposes that why its also shown in management studio GUI right click. But if you are running custom query just check that where clause / grouping etc are done is part of the clustered index.

Singh_A22
  • 16
  • 3
  • Ignore the names of the tables, they are subjectively defined taxonomies of the database I'm working with. – Ecanales Nov 07 '16 at 16:57