Retrieving billions of rows from remote server?

Question

I am trying to retrieve around 200 billion rows from a remote SQL Server. To optimize this, I have limited my query to use only an indexed column as a filter and am selecting only a subset of columns to make the query look like this:

SELECT ColA, ColB, ColC FROM <Database> WHERE RecordDate BETWEEN '' AND ''

But it looks like unless I limit my query to a time window of a few hours, the query fails in all cases with the following error:

OLE DB provider "SQLNCLI10" for linked server "<>" returned message "Query timeout expired".
Msg 7399, Level 16, State 1, Server M<, Line 1
The OLE DB provider "SQLNCLI10" for linked server "<>" reported an error. Execution terminated by the provider because a resource limit was reached.
Msg 7421, Level 16, State 2, Server <>, Line 1
Cannot fetch the rowset from OLE DB provider "SQLNCLI10" for linked server "<>".

The timeout is probably an issue because of the time it takes to execute the query plan. As I do not have control over the server, I was wondering if there is a good way of retrieving this data beyond the simple SELECT I am using. Are there any SQL Server specific tricks that I can use? Perhaps tell the remote server to paginate the data instead of issuing multiple queries or something else? Any suggestions on how I could improve this?

score 13 · Accepted Answer · answered Jul 28 '11 at 22:15

13

This is more of the kind of job SSIS is suited for. Even a simple flow like ReadFromOleDbSource->WriteToOleDbSource would handle this, creating the necessary batching for you.

answered Jul 28 '11 at 22:15

Remus Rusanu

288,378
40
442
569

2

+1, I believe that this is a more suited answer, since trying to do a `ROW_NUMBER` over 200 billion rows is gonna be a really difficult thing to accomplish – Lamak Jul 28 '11 at 23:04

score 5 · Answer 2 · edited May 23 '17 at 12:33

Why read 200 Billion rows all at once?

You should page them, reading say a few thousand rows at a time.

Even if you do genuinely need to read all 200 Billion rows you should still consider using paging to break up the read into shorter queries - that way if a failure happens you just continue reading where you left off.

See efficient way to implement paging for at least one method of implementing paging using ROW_NUMBER

If you are doing data analysis then I suspect you are either using the wrong storage (SQL Server isn't really designed for processing of large data sets), or you need to alter your queries so that the analysis is done on the Server using SQL.

Update: I think the last paragraph was somewhat misinterpreted.

Storage in SQL Server is primarily designed for online transaction processing (OLTP) - efficient querying of massive datasets in massively concurrent environments (for example reading / updating a single customer record in a database of billions, at the same time that thousands of other users are doing the same for other records). Typically the goal is to minimise the amout of data read, reducing the amount of IO needed and also reducing contention.

The analysis you are talking about is almost the exact opposite of this - a single client actively trying to read pretty much all records in order to perform some statistical analysis.

Yes SQL Server will manage this, but you have to bear in mind that it is optimised for a completely different scenario. For example data is read from disk a page (8 KB) at a time, despite the fact that your statistical processing is probably only based on 2 or 3 columns. Depending on row density and column width you may only be using a tiny fraction of the data stored on an 8 KB page - most of the data that SQL Server had to read and allocate memory for wasn't even used. (Remember that SQL Server also had to lock that page to prevent other users from messing with the data while it was being read).

If you are serious about processing / analysis of massive datasets then there are storage formats that are optimised for exactly this sort of thing - SQL Server also has an add on service called Microsoft Analysis Services that adds additional online analytical processing (OLAP) and data mining capabilities, using storage modes more suited to this sort of processing.

Since when is SQL Server not designed for large data sets? I've managed billions of row in SQL Server and didn't face any challenges I wouldn't have faced in other RDBMS systems. — Aaron Bertrand, Jul 29 '11 at 03:15
@Aaron -- SQLServer and especially with the add ons that come with the Enterprise edition is a superb platform for this sort of analysis. — James Anderson, Jul 29 '11 at 03:28
@James I think we are in agreement. Kragen suggested that SQL Server isn't designed for this much data. Must be thinking of ancient versions. — Aaron Bertrand, Jul 29 '11 at 03:33
@Kragen, SQL Server can handle billions of rows just fine and it has been able to do so for a long time now. Back in SQL 2000 (it was 2003 at the time) I was managing billions of rows in a single table. SQL Server 6.5 might have had issues with this much data load, but that was a VERY long time ago. — mrdenny, Jul 29 '11 at 03:55
Actually the more I think about this, the more I feel that the statement that I am using the wrong storage is not very accurate. I have handled 160 billion with MySQL without problems and close to 180 billion in SQL Server 2005 before. It was just that before, I had the database on disk but never had to transfer it remotely so I never really had experience issuing such large queries over the network. — Legend, Jul 29 '11 at 04:47
@James / everyone - SQL Server is designed for querying of large data sets, not processing large data sets - perhaps I should have emphasised this! :-) I've updated my question to clarify what I meant. — Justin, Jul 29 '11 at 09:42
@Kragen -- I think we agree really -- its the Analysis Service add ons that really make it rock! — James Anderson, Jul 29 '11 at 09:46

score 3 · Answer 3 · answered Jul 28 '11 at 22:02

3

Personally I would use a data extraction tool such as BCP to get the data to a local file before trying to manipulate it if I was trying to pull that much data at once.

http://msdn.microsoft.com/en-us/library/ms162802.aspx

answered Jul 28 '11 at 22:02

E.J. Brennan

45,870
7
88
116

score 2 · Answer 4 · answered Jul 28 '11 at 22:01

2

This isn't A SQL Server specific answer, but even when the rDBMS supports server side cursors, it's considered poor form to use them. Doing so means that you are consuming resources on the server even though the server is still waiting for you to request more data.

Instead you should reformulate your query usage so that the server can transmit the entire result set as soon as it can, and then completely forget about you and your query to make way for the next one. When the result set is too large for you process all in one go, you should keep track of the last row returned by the current batch so that you can fetch another batch starting at that position.

answered Jul 28 '11 at 22:01

SingleNegationElimination

151,563
33
264
304

As he only has ODBC access to the server its difficult to see how he could use anything but a "SELECT". – James Anderson Jul 29 '11 at 03:30
Server side cursors, in rDBMSes that support them, are a property of the connection, and is completely independent of the SQL statement used to generate the result set. – SingleNegationElimination Jul 29 '11 at 03:35

score 1 · Answer 5 · answered Jul 29 '11 at 03:56

1

Odds are the remote server has the "Remote Query Timeout" set. How long does it take for the query to fail?

answered Jul 29 '11 at 03:56

mrdenny

4,990
2
21
28

From SSMS, it is about 10 minutes. Using sqlcmd it is about 3 hours. – Legend Jul 29 '11 at 04:48
The 10 minute timeout really makes me thing it's the remote timeout setting as 10 minutes is the default. This setting would need to be changed on the remote side if that's the problem. Not sure why sqlcmd would run for 3 hours then die. Might be doing something else before running, plan generation, waiting on something else, etc. – mrdenny Jul 30 '11 at 06:16

score 1 · Answer 6 · answered Dec 06 '11 at 17:49

Just run into the same problem, I also had the message at 10:01 after running the query.

Check this link. There's a remote query timeout setting under Connections that's setup to 600secs by default and you need to change it to zero (unlimited) or other value you think is right.

Andrey Morozov · Answer 7 · 2015-08-24T11:06:24.063

0

Try to change remote server connection timeout property.

For that go to SSMS, connect to the server, right click on server's name in object explorer, further select Properties -> Connections and change value in the Remote query timeout (in seconds, 0 = no timeout) text box.

edited Aug 24 '15 at 11:06

answered Aug 12 '15 at 11:10

Andrey Morozov

7,839
5
53
75

Retrieving billions of rows from remote server?

7 Answers7

Linked