select random ~100 million rows of billions

Question

I'm a little compute-constrained because I don't want to kill the database or bandwidth while pulling for hours.

my current query is something like this:

SELECT TOP (100000000) mycolumns
FROM mydb
WHERE lower_control_limit> value1
    AND upper_control_limit < value2 ;

I am not getting a decent random sample, but I am afraid of putting commands that have to traverse billions of rows before returning values. I need a uniform random sample over the domain (rows).

What is the "best practice" for pulling a few hundred million rows from a store that is many billions of rows in size?

I've seen the random, but I'm just worried about the size of the query vs. limited computing and bandwidth resources.

If you remove the filter for limiting your sample, does it not come back with results for minutes? I guess this is not really a "sample" at this point. — sean, Oct 08 '15 at 18:41
Nope. Drowns the entire ram of my workstation and it freezes. Takes about a half-hour to pull it. I end up having to kill the process. — EngrStudent, Oct 08 '15 at 18:45
Are you pulling this to SSMS? Are you running the query from a different machine? — sean, Oct 08 '15 at 18:50
I'm running the database connection from JMP through an SQL connector to a remote database. I'm using the advanced tab to execute a manually typed query. — EngrStudent, Oct 08 '15 at 18:51
Ok, it will take time to "download" the results to your computer and for JMP to process it. The size of the results are limited by how much memory your computer has. I think everything your doing seems ok as far as bandwidth is concerned. You could have a DBA write the query to a CSV file and zip it then email it. — sean, Oct 08 '15 at 19:01

score 0 · Accepted Answer · edited May 23 '17 at 12:14

0

Found this: Select n random rows from SQL Server table

 SELECT TOP (100000000) mycolumns
 FROM mydb order by newid()

edited May 23 '17 at 12:14

Community

1
1

answered Oct 08 '15 at 18:48

sean

1,187
17
25

I'm going to try it, but if it runs for an hour or more then I'm going to have to kill the process and try something different. This has to run faster than that. – EngrStudent Oct 08 '15 at 18:56
1

If it does, you need to identify the bottle neck. Is it the download or JMP to process it, which takes longer. To find out you could use SSMS to export the query to a CSV then load it in JMP. Just right-click the DB and choose export, then flat file... If the export is quick then your bottleneck is JMP. You make ask for help about dealing with large sets in JMP. Or find another tool, or buy a bigger computer... – sean Oct 08 '15 at 19:07
Also, makes sure your not using WIFI, use ethernet. – sean Oct 08 '15 at 19:08
I have another query just counting the rows... it too is taking some time. It is going to return a few cells, but ... it has taken upwards of an hour. It is the SQL store - not the wire, not my workstation. I do like the idea of a better workstation. (muahaha http://www.eurocom.com/ec/configure(1,234,0)ec ) – EngrStudent Oct 08 '15 at 19:25
It took a good while to start counting the rows (~40 minutes) but now it si moving. Data transfer rate is consistent with previous run. – EngrStudent Oct 08 '15 at 19:33
select count(1) from table1 where ... will show you if the where clause is taking a while. – sean Oct 08 '15 at 20:09
It is not really fast. I do not know if the table has a primary key, but if you are asking whether there is uint64 rownames in a b-tree that allow O(n*log(n)) worst case operations, and most of the time faster, (aka pretty fast) operations - I would like to think so but don't know. – EngrStudent Oct 08 '15 at 20:12

select random ~100 million rows of billions

1 Answers1