2

I have about 24 million rows of data which I need to insert into a table with a standardised format.

My insert query is as follows:

SELECT ...
INTO TABLE2
FROM TABLE1

This is usually how I perform this task and there have been no issues with runtime, until now, where my query can run for about an hour and still would not be completed. Previously, such a query would only have taken me minutes (less than 10), but likely also because I have not had to deal with anything more than 5 million rows.

In my query, perhaps it is worth noting that I do many forms of manipulation to the data as well (due to the standardisation requirement as mentioned above). For example, the first few lines would look as follows:

SELECT 'XXX' AS [ID],
        REPLACE([Column 1], 'ABC', 'DEF') AS [Column 1],
        CASE WHEN [Column 2] = '1' THEN 'X'
             WHEN [Column 2] = '2' THEN 'Y'
             ELSE NULL
        END AS [Column 2]
...

In other words, I basically use all sorts of functions in my SELECT INTO query, such as REPLACE, CASE WHEN, IIF and LEFT. I also use computational functions, such as FLOOR, CONVERT and dividing between two columns. There are a total of 73 columns in my SELECT INTO query, of which some can be fixed and NULL as well.

Thus, my question is whether there is any way to speed up my query. I use SQL a lot for my work but did not formally learn it in college, so I do know about the existence of the concept of execution plans but I am not too sure how it works. I feel like that has something to do with the slow runtime here? In particular, would it make more sense (in terms of total runtime), if I were to execute my SELECT INTO query without all the computations/standardisations first and then do those in a second query?

Finally, perhaps a few more details worth noting:

  1. This table does not exist prior to my SELECT INTO query, so I am creating it with the query.
  2. A lot of data cleaning and processing steps have been done prior to the SELECT INTO query and if any of those steps need to be changed, then I would have to run this again, which, in my opinion, is very inefficient for this particular case and hence, my question for asking whether there are any ways to speed this up.
  3. I am on SQL Server 2016.

A side question

Still on the topic of query runtime but slightly digressing, I also notice that runtimes for the same query can have a lot of variance. This is obviously not significant when the runtime is generally small, but for example, in this query, it took 43 minutes to run today when it did not even complete after more than an hour yesterday. What's the reason for such variances?


I also post this from a broader perspective as I am genuinely curious as to whether this is, conceptually, not an ideal way of inserting data into a table, so that I can improve my own processes for the future.

I would greatly appreciate any intuitive suggestions :)

Ethan Mark
  • 293
  • 1
  • 9
  • 2
    In the first instance it would make sense to learn about execution plans so that you can understand what is happening performance wise. Ultimately that is the only way to know what is performing slowly because... – Dale K Aug 25 '23 at 02:42
  • 1
    SQL is a declarative language, meaning you describe to the database engine the results you want. But you aren't telling it how to return those results. It works that out based on a whole bunch of information which is displayed in the execution plan. Please provide your execution plan using ["Paste The Plan"](https://www.brentozar.com/pastetheplan/). – Dale K Aug 25 '23 at 02:43
  • 2
    But you can break it down, remove the insert into and see if a straight select is any faster. Probably not much if you're creating a new table. But then try commenting out columns, especially those with more calculations/function calls, and see if removing any of them speeds it up. – Dale K Aug 25 '23 at 02:45
  • 1
    You side question is very general - probably too general. Server load will affect query time. But more importantly query statistics - which I advise you to read up on. In summary, in order for SQL Server to be able to estimate what might be a good way to obtain data it stores statistics about data it has obtained before. This obviously changes over time, and SQL Server will therefore adjust its query plans over time. Also I assume you are providing different data each run? – Dale K Aug 25 '23 at 02:50
  • 1
    Just to check, are you allowed to share that query with the public? Based on that, you're limited by the insert, you can see most of the time is spent inserting data to the table. However that is only an estimate, and may not match a real run. Inserting into a new table with no constraints is mostly resource limited CPU/RAM. However I would check a real execution plan. And I would carry out the analysis I suggested in my 3rd comment from the top (although nothing in your query stands out as being slow). – Dale K Aug 25 '23 at 03:13
  • Perhaps you can split your work into batches, which handle less amount of rows per batch. Also, do you really have need to recurringly process 24 millions of rows? It sounds a bit strange :) – siggemannen Aug 25 '23 at 08:23
  • Consider if it's possible to run it as an `Insert` statement with minimal logging [INSERT (Transact-SQL) Best Practices](https://learn.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver16#best-practices) – Paolo de Sinno Aug 25 '23 at 10:36
  • @PaolodeSinno But why is `INSERT INTO ... SELECT ... FROM ...` considered "better" than `SELECT ... INTO ... FROM ...`? – Ethan Mark Aug 25 '23 at 14:39
  • @siggemannen I thought of that, but I wanted one single query just to make things look "prettier". Also, it's not really recurring, but one of my concerns could be that I might (just might) have to run it again (maybe just once more or twice) or none at all and anyway I thought that a query that takes this long didn't sound right to me too! – Ethan Mark Aug 25 '23 at 14:42
  • 1
    @EthanMark can I encourage you to research your questions before asking? You will learn a lot more. For example you asked about insert into, well a quick search returns https://stackoverflow.com/questions/6947983/insert-into-vs-select-into which should answer your question – Dale K Aug 26 '23 at 00:20
  • You can use the [Live Query Statistics](https://www.mssqltips.com/sqlservertip/3685/live-query-statistics-in-sql-server-2016/) feature to look at the execution plan in progress and see where the time is being spent – Martin Smith Aug 26 '23 at 22:40
  • 1
    RE: "it took 43 minutes to run ... after more than an hour yesterday." - the additional time could be (for example) as (a) had different and more efficient execution plan in the fast case so less work to do. (b) was running at higher degree of parallelism in the fast case so did same amount of work but was able to do it in less elapsed time as hogging more CPU. (c) the slower execution was encountering waits that did not occur in the faster case (would need to look at the waits for the slower execution to give specific explanation - example less IO waits if data already in cache) – Martin Smith Aug 26 '23 at 22:59

0 Answers0