0

I can't find a good answer to my problem.

I have a mysql query with an inner join and an order by rand() and a limit X. When I remove the order by rand() the query is 10 times faster. Is there a more efficient way to get a random subset of 500 rows? Heres a sample query.

Select * from table1 
inner join table2 on table1.in = table2.in
where table1.T = A
order by rand()
limit 500;
Kailash Yadav
  • 1,880
  • 2
  • 18
  • 37
Conrad Lewis
  • 55
  • 1
  • 12

2 Answers2

2

This should help:

Select *
from table1 inner join
     table2
     on table1.in = table2.in
where table1.T = A and rand() < 1000.0/20000.0
order by rand()
limit 500

This will limit the result set to about 1000 random rows before extracting a random sample of 500. The purpose of getting more rows than expected is just to be sure that you get a large enough sample size.

Here is an alternative strategy, building off the "create your own indexes" approach.

Create a temporary table using the following query:

create temporary table results as
(Select *, @rn := @rn + 1 as rn
from table1 inner join
     table2
     on table1.in = table2.in cross join
     (select @rn := 0) const
where table1.T = A
);

You now have a row number column. And, you can return the number of rows with:

select @rn;

Then you can generate the ids in your application.

I would be inclined to keep the processing in the database, using these two queries:

create temporary table results as
(Select *, @rn := @rn + 1 as rn, rand() as therand
from table1 inner join
     table2
     on table1.in = table2.in cross join
     (select @rn := 0) const
where table1.T = A
);

select *
from results
where therand < 1000/@rn
order by therand
limit 500;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 1000.0/20000.0 where 20000 is the size of the super set? The size of the set is dynamic. will count(*) work there? – Conrad Lewis May 16 '13 at 21:24
  • You estimated the overall size to be 20,000 so this should pull about 1,000 rows. – Gordon Linoff May 16 '13 at 21:25
  • so the 20,002th row will never be selected? – Conrad Lewis May 16 '13 at 21:27
  • @ConradLewis . . . Not at all. It calculates a random number for each row and picks a row with a probability of about 5%. This should be about 1,000 rows. It then chooses 500 randomly from them. The purpose is to reduce the size of the sort by a factor of 20. – Gordon Linoff May 16 '13 at 21:29
  • This has a chance of returning fewer than 500 rows. – Bill Karwin May 16 '13 at 21:34
  • @BillKarwin . . . If you are familiar with statistics, you would realize that the chance of returning fewer than 500 rows when randomly selecting 1000 out of 20000 is so small as to be negligible, assuming the estimate of the number of rows true. – Gordon Linoff May 16 '13 at 21:35
  • Yes, I admit it's a small chance. :-) – Bill Karwin May 16 '13 at 21:36
  • the problem here is the super set is completely dynamic depending on the query parameters. The max super set size is 20,000, but the set size could be 30, and this method would return no results. – Conrad Lewis May 16 '13 at 21:37
  • @ConradLewis . . . Can you modify your question to explain the relationship between the tables? – Gordon Linoff May 16 '13 at 21:42
  • the info in table2 belongs to a row in table1. its a one to many relationship. – Conrad Lewis May 16 '13 at 21:47
  • I ran your first suggestion with 20,000 replaced by "select count(*) ...." and it worked nicely. – Conrad Lewis May 16 '13 at 22:13
0

A good way is to do it in application level in 2 steps:

  1. Get the row count of your dataset and "extract" 2 random numbers between 0 and your count.
  2. Use these number from (1) as offset to your LIMIT

Try it and measure if performance is acceptable for you.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
  • This only returns one random row. The OP wants up to 500 random rows. – Bill Karwin May 16 '13 at 22:02
  • Well, you can generate 2 random numbers or 500 random numbers, that's no problem. As long as you can generate 500 *distinct* random numbers without choosing the same number more than once. But then using the offset technique you would have to run 500 separate queries (in addition to the first query to get the count). – Bill Karwin May 17 '13 at 17:48
  • @BillKarwin: I get what you are saying.My suggestion was to generate 2 random numbers and use them as range.For a small dataset it would be no-sense for the reason you correctly say.For a fairly large dataset we can get random ranges of numbers that could repeat. That *might* be acceptable – Cratylus May 17 '13 at 18:23
  • Yes, all the solutions for selecting random rows from a database have to make some compromise in how truly random the result is. – Bill Karwin May 17 '13 at 18:36