0

I have a very large table in MySQL database, which has a columns names exa_id and the number of rows of this table is more than 10,000,000. I want to randomly and efficiently select only 1000 of the data through pandas.read_sql statement in Python. How can I write the code?

The SQL select ext_id from table_name order by rand() limit 1000 performs really bad, I'd like to another way.

One more explanation is that the contents of column exa_id are strings, like 'uudjsx-2220983-df','ujxnas-9800xdsd-d2',..., not auto-increasing sequence.

CoffeeSun
  • 33
  • 3

2 Answers2

1

This works under most circumstances:

select ext_id
from table_name t
where rand() < 2000 / 10000000
order by rand()
limit 1000;

The inner query selects approximately 2000 rows. There is some statistical variability. The outer query then orders these randomly and selects 1000 of them.

If you don't know the number of rows, you can do:

select t.ext_id
from table_name t cross join
     (select count(*) as cnt from t) tt
where rand() < 2000 / tt.cnt
order by rand()
limit 1000;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

This Query Will help you.

SELECT name  FROM random AS r1
JOIN (SELECT CEIL(RAND() * (SELECT MAX(id) FROM random)) AS id) AS r2 
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 100
JohnR
  • 59
  • 1
  • 8