0

MySQL Get random bar data

Scenes

There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.

Suppose there is a data table

sql Create table topic (   Id int primary key not null   Comment 'number',   Content varchar(20) not null   Comment 'content' )   Comment 'topic table';

The topic table here has two key features - Primary key can be compared (int) - There is a trend in the overall primary key (self-increase/decrease)

Solution 1: Directly use order by rand()

You can get random data directly by using order by rand(), and you can get all the data (the order is still random).

  1. According to the result of rand()     > This step is equivalent to adding a column of data generated by the rand() function to each data, and then sorting the column
  2. Limit the number of queries

sql     Select *     From topic     Order by rand()     Limit 50000;

But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.

Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.

In fact, order by rand() looks strange, actually equivalent to:

sql     Select *     From           Select             Topic.*,             Rand() as order_column           From topic         ) as temp     Order by order_column     Limit 50000;

Solution 2: Use where to take the middle random value

Since the ordering caused by order by rand() without indexing is too time consuming, we can try to get around this problem.

The following solution is like this

  1. Take a random value between the minimum and maximum values
  2. Determine if the id is greater than (or less than) this random value
  3. Limit the number of queries

sql Select * From topic Where id >= ((select max(id)               From topic)             - (select min(id)                 From topic))             * rand()             + (select min(id)               From topic) Limit 50000;

This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.

So, here's the defect of the method

  • The acquired data is affected by the distribution density

      For example, the data distribution is as follows

      1,100002,100003,100004...199999,200000

      Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change >= to <=, then the average number that can be obtained will be greatly increased (about 7.5w).

          

    The code formatting here has been in error and I can't solve it. . .

    Select *       From topic       # Note: The symbols here have been modified.       Where id >= ((select max(id)                     From topic)                   - (select min(id)                       From topic))                   * rand()                   + (select min(id)                     From topic)       Limit 50000;

  • The probability of each piece of data is not exactly the same   Although all the data obtained is random, the probability of each is not the same. For example, when <=, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to >=, the first piece of data obtained is generally too small.   Use the result of >=   - The more data is in front, the lower the probability of getting it   - But even with very low probability, there is always a chance at the top, so the first one is generally small   - When the data density is too large, the number obtained will be very small

The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2, otherwise it will deviate more (not necessarily too large or too small).

Solution 3: Using the temporary table temporary

Solution 2 Focus on avoiding sorting with rand() without indexing, but think about another solution here, sorting with the added rand() after indexing. Create a temporary table containing only the primary key id and the index column randomId that needs to be sorted, and then get the out-of-order data after the sorting is completed.

sql     Drop temporary table if exists temp_topic;     Create temporary table temp_topic (       Id bigint primary key not null,       randomId double not null,       Index (randomId)     )       As         Select           Id,           Rand() as randomId         From topic;     Select t.*     From topic t       Join (             Select id             From                     Select id                     From temp_topic                     Order by randomId                   ) as temp             Limit 50000           ) as temp         On t.id = temp.id;

The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.

to sum up

Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .

| Differences | order by rand() | where | temporary | | -------------------------------------------- | ----------------- | ----------------- | ----------- | | Can get all at random | Yes | Almost impossible | Can | | Speed ​​ | Slow | Very fast | Faster | | Need a comparable primary key type | No | Yes | No | | Affected by data distribution density | No | Yes | No | | Speed ​​is affected by table data complexity | Very large | Very small | Small |

rxliuli
  • 165
  • 1
  • 11
  • Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended. – Solarflare Nov 10 '18 at 09:48
  • @Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way? – rxliuli Nov 10 '18 at 14:49
  • If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different). – Solarflare Nov 10 '18 at 16:14
  • Does it work any faster if you do `select * from topic where id in (select id from topic order by rand() limit 1000)`? BTW you're selecting 50% of rows... is that correct? – Salman A Nov 12 '18 at 14:01
  • First of all, sql `select * from topic where id in (select id from topic order by rand() limit 1000)` can't run, and the speed of using `order by rand()` will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . . – rxliuli Nov 13 '18 at 19:45
  • I just added a third solution, but I still can't meet the ideal requirements. . .┐( ̄ー ̄)┌ – rxliuli Nov 13 '18 at 19:46
  • Do you need to get 50,000 random rows all at once? Can you paginate? Do they have to be unique? – Schwern Nov 13 '18 at 19:52
  • Possible duplicate of [MySQL: Alternatives to ORDER BY RAND()](https://stackoverflow.com/questions/1823306/mysql-alternatives-to-order-by-rand) – Schwern Nov 13 '18 at 19:53
  • [This answer] (https://stackoverflow.com/a/36013954/8409380) It looks great, but after the `where rand()` is getting faster, I don't know what's going on inside. . . Is there any information to view? – rxliuli Nov 13 '18 at 20:16

0 Answers0