I have about 24 million rows of data which I need to insert into a table with a standardised format.
My insert query is as follows:
SELECT ...
INTO TABLE2
FROM TABLE1
This is usually how I perform this task and there have been no issues with runtime, until now, where my query can run for about an hour and still would not be completed. Previously, such a query would only have taken me minutes (less than 10), but likely also because I have not had to deal with anything more than 5 million rows.
In my query, perhaps it is worth noting that I do many forms of manipulation to the data as well (due to the standardisation requirement as mentioned above). For example, the first few lines would look as follows:
SELECT 'XXX' AS [ID],
REPLACE([Column 1], 'ABC', 'DEF') AS [Column 1],
CASE WHEN [Column 2] = '1' THEN 'X'
WHEN [Column 2] = '2' THEN 'Y'
ELSE NULL
END AS [Column 2]
...
In other words, I basically use all sorts of functions in my SELECT INTO
query, such as REPLACE
, CASE WHEN
, IIF
and LEFT
. I also use computational functions, such as FLOOR
, CONVERT
and dividing between two columns. There are a total of 73 columns in my SELECT INTO
query, of which some can be fixed and NULL
as well.
Thus, my question is whether there is any way to speed up my query. I use SQL a lot for my work but did not formally learn it in college, so I do know about the existence of the concept of execution plans but I am not too sure how it works. I feel like that has something to do with the slow runtime here? In particular, would it make more sense (in terms of total runtime), if I were to execute my SELECT INTO
query without all the computations/standardisations first and then do those in a second query?
Finally, perhaps a few more details worth noting:
- This table does not exist prior to my
SELECT INTO
query, so I am creating it with the query. - A lot of data cleaning and processing steps have been done prior to the
SELECT INTO
query and if any of those steps need to be changed, then I would have to run this again, which, in my opinion, is very inefficient for this particular case and hence, my question for asking whether there are any ways to speed this up. - I am on SQL Server 2016.
A side question
Still on the topic of query runtime but slightly digressing, I also notice that runtimes for the same query can have a lot of variance. This is obviously not significant when the runtime is generally small, but for example, in this query, it took 43 minutes to run today when it did not even complete after more than an hour yesterday. What's the reason for such variances?
I also post this from a broader perspective as I am genuinely curious as to whether this is, conceptually, not an ideal way of inserting data into a table, so that I can improve my own processes for the future.
I would greatly appreciate any intuitive suggestions :)