0

I am new to using SQL and I have a large dataframe consisting of 40,000,000 rows and 30 columns. What code can I use to subset my data into tables which contain 5,000,000 rows each.

I have tried using the code:

SELECT TOP 50000000 [X],
           ,[Y]
           ,[Z]
INTO [NEW_TABLE]
FROM [OLD_TABLE]

but I do not know how to alter my code to allow me to produce a table with the other rows.

Beum
  • 61
  • 5
  • you will have to drop [new_ table] and recreate it again. – Rima Apr 09 '19 at 14:27
  • Why would you want to? I suppose you could `partition the table` based on some factors in the data. But 4,000,000 is nothing for a RDBMS Also; Top w/o an order by isn't wise as order could vary based on execution and thus the same record could occur in multiple sets and some excluded. – xQbert Apr 09 '19 at 14:27
  • 2
    you could use ntile(4)over(partition by x, y, z order by x, y, z) this will give you a ranking of each quartile (for your purposes this should split the result set in to 4 chunks of 1 million rows each). – Dwight Reynoldson Apr 09 '19 at 14:28
  • @DwightReynoldson Sorry I made a mistake when writing my post and have edited changes. How can I split my data if i didnt want to split it into quartiles? – Beum Apr 09 '19 at 14:41
  • with ntile you can specify however many groups you like just change the number 4 for whatever number of groups you require. – Dwight Reynoldson Apr 09 '19 at 15:06

1 Answers1

0

If this is a one-off exercise, try running 8 statements changing the offset and fetch values. offset skips the provided number of rows, and fetch returns the number of rows indicated after the skip:

SELECT [X],
           ,[Y]
           ,[Z]
INTO [NEW_TABLE1]
FROM [OLD_TABLE]
ORDER BY [X] -- Or whatever your primary key column is
         OFFSET 0 ROWS
         FETCH NEXT 5000000 ROWS ONLY

then

SELECT [X],
           ,[Y]
           ,[Z]
INTO [NEW_TABLE2]
FROM [OLD_TABLE]
ORDER BY [X] -- Or whatever your primary key column is
         OFFSET 5000000 ROWS
         FETCH NEXT 5000000 ROWS ONLY

then

SELECT [X],
           ,[Y]
           ,[Z]
INTO [NEW_TABLE3]
FROM [OLD_TABLE]
ORDER BY [X] -- Or whatever your primary key column is
         OFFSET 10000000 ROWS
         FETCH NEXT 5000000 ROWS ONLY

and so on...

If this needs to be a regular task, shout and I will show you how to loop it.

TJB
  • 787
  • 1
  • 8
  • 29
  • This worked thank you! My SQL-server disconnects if i try to do more than 1,000,000 rows, do you know of any way to overcome this? – Beum Apr 09 '19 at 15:21
  • create it as a stored procedure and try run it through SQL agent, it might be your client timing out. – TJB Apr 09 '19 at 15:31
  • I'm so sorry if this question is very basic but I'm new to SQL. How do i create a stored procedure and run it through SQL agent? – Beum Apr 09 '19 at 15:34
  • I'll provide you with reading on this. Stored procedures: https://learn.microsoft.com/en-us/sql/relational-databases/stored-procedures/create-a-stored-procedure?view=sql-server-2017 ; Run stored procedure: https://stackoverflow.com/questions/1216272/how-do-i-execute-a-stored-procedure-in-a-sql-agent-job and https://learn.microsoft.com/en-us/sql/ssms/agent/create-a-transact-sql-job-step?view=sql-server-2017 – TJB Apr 09 '19 at 15:38