MySQL Get random bar data
Scenes
There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.
Suppose there is a data table
sql
Create table topic (
Id int primary key not null
Comment 'number',
Content varchar(20) not null
Comment 'content'
)
Comment 'topic table';
The
topic
table here has two key features - Primary key can be compared (int
) - There is a trend in the overall primary key (self-increase/decrease)
Solution 1: Directly use order by rand()
You can get random data directly by using order by rand()
, and you can get all the data (the order is still random).
- According to the result of
rand()
> This step is equivalent to adding a column of data generated by therand()
function to each data, and then sorting the column - Limit the number of queries
sql
Select *
From topic
Order by rand()
Limit 50000;
But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.
Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.
In fact, order by rand()
looks strange, actually equivalent to:
sql
Select *
From
Select
Topic.*,
Rand() as order_column
From topic
) as temp
Order by order_column
Limit 50000;
Solution 2: Use where to take the middle random value
Since the ordering caused by order by rand()
without indexing is too time consuming, we can try to get around this problem.
The following solution is like this
- Take a random value between the minimum and maximum values
- Determine if the id is greater than (or less than) this random value
- Limit the number of queries
sql
Select *
From topic
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.
So, here's the defect of the method
The acquired data is affected by the distribution density
For example, the data distribution is as follows
1,100002,100003,100004...199999,200000
Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change
>=
to<=
, then the average number that can be obtained will be greatly increased (about 7.5w).The code formatting here has been in error and I can't solve it. . .
Select * From topic # Note: The symbols here have been modified. Where id >= ((select max(id) From topic) - (select min(id) From topic)) * rand() + (select min(id) From topic) Limit 50000;
The probability of each piece of data is not exactly the same Although all the data obtained is random, the probability of each is not the same. For example, when
<=
, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to>=
, the first piece of data obtained is generally too small. Use the result of>=
- The more data is in front, the lower the probability of getting it - But even with very low probability, there is always a chance at the top, so the first one is generally small - When the data density is too large, the number obtained will be very small
The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2
, otherwise it will deviate more (not necessarily too large or too small).
Solution 3: Using the temporary table temporary
Solution 2 Focus on avoiding sorting with rand()
without indexing, but think about another solution here, sorting with the added rand()
after indexing. Create a temporary table containing only the primary key id
and the index column randomId
that needs to be sorted, and then get the out-of-order data after the sorting is completed.
sql
Drop temporary table if exists temp_topic;
Create temporary table temp_topic (
Id bigint primary key not null,
randomId double not null,
Index (randomId)
)
As
Select
Id,
Rand() as randomId
From topic;
Select t.*
From topic t
Join (
Select id
From
Select id
From temp_topic
Order by randomId
) as temp
Limit 50000
) as temp
On t.id = temp.id;
The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.
to sum up
Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .
| Differences | order by rand()
| where
| temporary
|
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed is affected by table data complexity | Very large | Very small | Small |