SQL Random rows in a big table (with where clause)

Question

I have a website where people can vote on cars. 4 cars are showed to the user and he/she can vote on the car they like most.

The table cars has the important columns:

car_id   int(10) (not auto_increment, so has gaps)
views    int(7)
points   int(7)
car_type int(1) (value = 1, 2 or 3)

At the moment I use a mapping table for all car_types, which has a PK which has no gaps. I select the max ID of the mapping table and create 4 random numbers (PHP), select those rows from mapping and get the corresponding car_id's. These numbers I use to select the cars from cars table.

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

My question is how to show 4 cars which have the same amount of points(random) ordered by the least number of views (views asc). Also the important notes:

The select should only query cars with at least 1 point.
The database will contain more than 30M cars, it's not about cars but for the question I think it's easier :).
When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.
The query will be used to show 4 cars to the visitor, we all know users are impatient so the faster the query the better :)
I could (if needed) use a mapping table, which will have no gaps in the PK (as I have now).
Cars only within a certain car_type will be showed. Example, 4 randoms of car type 2 (which is family cars), since I don't want to show sport cars and family cars at same time.

If you know another solution to solve the problem above, I'm willing to accept all kind of solutions (PHP/SQL).

Bounty because it's a bigger question (/answer) than the average Stackoverflow question. Bounty will be rewarded to person which describes the solution or (preferred) the code of the solution. Anyways it's my way to thank the persons helping me and to ensure I appreciate your help greatly.

UPDATE:

Thanks for all the answers so far! You're all right in your answers. I did think a lot about it last few hours and I started to realize that databases were actually never build for things like this (showing random data), it was created to show precise and accurate data with fast access. That's why selects on PK's with 30M rows or more are still very fast. Thats why I'm thinking about doing all the random stuff in PHP. So I generate in PHP like 40 random numbers and select those 40 rows from the mapping table of the right car type. This select with IN is really fast (like 0.0006 seconds). After this select I got 40 car_ids which I also select with IN from the cars table. I loop the cars and put them in an array and do some custom sorts (based on points and views). After this I select a random number from all the points within the 40 cars and grab the cars from the array closest to this number of points and with the least views. This way PHP takes care of the randomness and the views part and the queries because you ask for precise data are very fast (0.0006 seconds each).

So what's the question? How to efficiently query a 30m row table in order to obtain 4 random records? — N.B., Jun 12 '13 at 12:56
@N.B. Yes and no :) I know it's a big question and bounty(100) will be set within 1.5 hours. — Kevin Vermaat, Jun 12 '13 at 12:58
Well, it's not really clear what you want. A method for optimal select, an advice about the design, a query that distributes the points, advice about efficient querying the table etc.. you didn't really define what is it that troubles you or if you actually have a problem or not. Try to clarify that before opening a bounty. — N.B., Jun 12 '13 at 13:01
@N.B. Thanks a lot for the feedback, yeah exactly! that's why I edit, but unfortunately English is not my native language, but I will rewrite it now :) — Kevin Vermaat, Jun 12 '13 at 13:03
Show us some sample data and expected results so I could write an unit test. — takeshin, Jun 12 '13 at 18:09
@takeshin Ideally I would end up with 4 cars in my PHP code to print to the user. Those rows will have same points (or almost same) and have less views than other rows with those points. — Kevin Vermaat, Jun 12 '13 at 18:13
I think the concept you're looking for is called 'decay' (in the context of popularity algorithms). Have you looked into that at all? — Strawberry, Jun 12 '13 at 18:34

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

I'd love to give a specific answer, but I'd need help to understand your thought process...

You start by writing:

I have a website where people can vote on (...) the car they like most.

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

But you then go on on writing:

When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.

To me, the latter spec makes little sense in light of the first remark.

Imho, what you really want is for users to have the same amount of opportunities to vote on each car. Or more precisely, to vote for each car compared to each other car.

If you assume that the (car) variables are independent, then you need to keep a count of how many times a choice came up, rather than how many times it got voted for, and adjust your decision process accordingly. It's a math problem, it's not so ugly, and it can then be translated into SQL for better or worse -- I'd venture that it'll probably be worse.

If you assume, like I do, that they're not independent, you also need to account for correlations -- and store how many times they came up with each other too. Because, well, there's an infinitely slim chance that you won't prefer this Mercedes rather than that Tata, that Xinkai or that AvtoVAZ. But given a choice between the same Mercedes, a BMW, a Porsche and a Ferrari, the decision might not be so clearcut.

In other words, your spec doesn't answer the problem as you presented it at all.

I'm currently begging to agree with the answer posted two hours ago: pick them really random, and you'll be satisfied without extra code...

As a side note, if your ids really have no gaps, generate four ids in php or whatever and fetch them using an in() statement. You won't get more efficient than that.

Thanks a lot for the answer, I understand that my specs are a little strange and that the points you stated above are against the means of each other. I did find out that I was thinking too difficult and the solution could be much easier (as you say). — Kevin Vermaat, Jun 12 '13 at 22:38
I ended up using you IN as you said since asking the database for precise data (where id=.. or where id IN ...) is very fast. Check my updated question and please let me now what you think :) the solution is mostly based on your answer and some others. — Kevin Vermaat, Jun 12 '13 at 23:11
Ended up using the IN function and creating the random ids in PHP with mapping table (gapless). So I accept your answer since you say exactly that in the side note :) — Kevin Vermaat, Jun 15 '13 at 13:06

score 2 · Answer 2 · answered Jun 12 '13 at 20:46

2

I am not sure if you can do that, but what about if you forget the 'random' and create a different formula to simulate that? My suggestion is to create one column lastViewed of date type in the cars table, so during the update of the views column, it will also update the lastViewed with the current date

Then the select query can be done of this way:

select * from cars where points=?, car_type=? order by views desc, lastViewed limit 4

This sql will always return 'random' results for the visitor based in the low views and the latest date it was viewed. The cool part of this solution is that it will give priority for the one that haven't been viewed for a while. So when insert a new car, the default value for lastViewed can be a date like from year 1900.

answered Jun 12 '13 at 20:46

fmodos

4,472
1
16
17

I understand the idea and it's definitely worth trying out :) – Kevin Vermaat Jun 12 '13 at 22:36
After some testing I find out limit is a bit too slow :/ check my update though, maybe thats the solution? – Kevin Vermaat Jun 12 '13 at 22:57
@KevinVermaat you mean that setting limit in the sql cause it to run slower? the only issue that I seee about using random is that it is not precise and might have some case where certain cars will have less visiblitty than others. – fmodos Jun 12 '13 at 23:04
@fmodos Mmm.. you're probably right, probably its the order by that causes the query to run slow. :/ I have however indexes on views and lastviewed and points. – Kevin Vermaat Jun 12 '13 at 23:06
Updated one more time with a more detailed description of the solution I got at the moment (based your answer and all the others), what do you think? – Kevin Vermaat Jun 12 '13 at 23:09
@KevinVermaat I see that, I just think you could work on a more precise solution.... as you only give priority to the views column after the 50 random rows are selected, this is very low based in the amount of data this table have – fmodos Jun 12 '13 at 23:15
@KevinVermaat please do one test and run the query with the order by: `order by lastViewed limit 100` and let me know the time to process it, my idea is to reduce the scope of the query in the mysql and then you can do a sort of the views in the php code... it will be faster as it will have only 100 records or maybe less – fmodos Jun 12 '13 at 23:19

score 1 · Answer 3 · answered Jun 10 '13 at 15:17

Yo could write a store procedure that does the follows:

(Don't take the syntax as correct as is mostly pseudocode)

First select the points:

SELECT @varpoints = points FROM cars ORDER BY RAND() LIMIT 1

This way we can obtain a random value of the points column.

Store that value in a var and do something like this (pseudocode):

WHILE (SELECT COUNT(car_id) FROM cars WHERE points = @varpoints ORDER BY views ASC) > 4
{
     SET @varpoints = @varpoints - 1;
}

Now retrieve just the SQL with the needed results:

SELECT car_id FROM cars WHERE points = @varpoints ORDER BY views ASC

And that should do the work.

This will take a random points value and do a query of cars with that value. If it doesn't get at least 4, it will substract 1 and try again. Some sort of try catch would be nice if exists a chance of having less than 4 cars of each points.

I'm affraid it's not exactly what I want. I would prefer to take in account the distribution of points, besides the query you say is almost what i have now which is too slow unfortunately :/ — Kevin Vermaat, Jun 10 '13 at 16:07

score 1 · Answer 4 · answered Jun 12 '13 at 17:57

So, it looks that your main problem is the speed. In that case you can do some pre-processing, let's say - to have a table which to use like a queue, containing groups of 4 cars, ready to be shown. Of course, that will discriminate last moment viewed/voted cars, but you can refresh this queue on a regular intervals.

When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.

In case you pick them really random, that would be satisfied without extra code.

score 1 · Answer 5 · answered Jun 12 '13 at 21:40

Make sure you have combined index on car_type and points columns. First, fetch the total count of rows that you're interested in:

$condition = "car_type=? AND points > 0";
$q_count = "SELECT count(*) FROM cars WHERE {$condition}";
$r_count = mysql_query($q_count);
$car_count = mysql_result($r_count, 0, 0);

$car_count holds the max integer we can generate to retrieve a car. We're going to use this in the random number generator. Now SELECT all rows you're interested in:

$q_cars = "SELECT car_id FROM cars WHERE {$condition}";
$r_cars = mysql_query($q_cars);
$car_ids = array();
for($i = 0; $i < 4; ++$i)
{
    $random_row = rand(0, $car_count);
    $car_ids[] = mysql_result($r_cars, $random_row, 0);
}

I'm assuming normal distribution from the random number generator. Not checking for empty table or duplicate records, because in 30M records the probability of that is very low. You may want to consider adjustments to your table schema. Table of 30M records is slow to query, ORDER BY RAND() is really bad as well as using LIMIT and OFFSET.

After a lot of thinking and trying I have to say you're right. I updated my question, please let me know what you think about the solution I use now (which is mainly based on yours) — Kevin Vermaat, Jun 12 '13 at 23:10

score 0 · Answer 6 · edited May 23 '17 at 12:08

I think you can do this in vanilla sql referencing a post from http://www.kahunaburger.com/2008/10/13/selecting-random-weighted-records-from-mysql/ which I traced via this link: MySQL: Select Random Entry, but Weight Towards Certain Entries

The difficult bit with your question is the probability and that post posed the same problem. The solution is given in the comments as being ORDER BY -LOG(1.0 - RAND()) / weighting.

SELECT * FROM cars
WHERE points >= 1
    AND car_type = ROUND((RAND() * 2) + 1)
ORDER BY -LOG(1.0 - RAND()) / 70
LIMIT 4;

Hope this helps.

score 0 · Answer 7 · answered Jun 14 '13 at 06:21

Oracle uses Rownum to represent a fake ID that has no gaps.

Select rownum, c.* from Cars where points > 1

will give you a resultset that has an ordered ID.

With that given hint, you can simulate the same thing with the following mysqlCode referred here

SELECT @rownum:=@rownum+1 rownum, c.* 
  FROM (SELECT @rownum:=0) r, Cars c where points > 1;

Now having the expected result set similar to as in oracle you can select any 4 rows by using subselects. Assuming you know the rowcount of cars that have more than one point.

select * from 
 (the above select)
where rownum in (Random[0], Random[1], Random[2], Random[3])

or any way you want to construct the query.

This won't move the whole data in between mysql and php, but it will put some pressure on mysql.

SQL Random rows in a big table (with where clause)

7 Answers7

Linked