select 1 random row with complex filtering

Question

I've 2 tables:

first table users:

+-------------------------+---------+------+-----+---------+-------+
| Field                   | Type    | Null | Key | Default | Extra |
+-------------------------+---------+------+-----+---------+-------+
| id                      | int(11) | NO   | PRI | NULL    |       |
| first_name              | text    | NO   |     | NULL    |       |
| age                     | int(11) | YES  |     | NULL    |       |
| settings                | text    | YES  |     | NULL    |       |
+-------------------------+---------+------+-----+---------+-------+

second table proposals:

+---------+---------+------+-----+---------+----------------+
| Field   | Type    | Null | Key | Default | Extra          |
+---------+---------+------+-----+---------+----------------+
| id      | int(11) | NO   | PRI | NULL    | auto_increment |
| from_id | int(11) | NO   |     | NULL    |                |
| to_id   | int(11) | NO   |     | NULL    |                |
| status  | int(11) | NO   |     | NULL    |                |
+---------+---------+------+-----+---------+----------------+

I need to get 1 random row from users which id is not in to_id in proposals

I'm doing it (without rand) with this sql:

SELECT DISTINCT *
FROM profiles
WHERE
    profiles.first_name IS NOT NULL
AND
NOT EXISTS (
    SELECT *
    FROM proposal
    WHERE
        proposal.to_id = profiles.id
)
LIMIT 0 , 1

performance is fine: 1 row in set (0.00 sec)

but perfomance is very bad: 1 row in set (1.78 sec) when I add ORDER BY RAND() to the end

I've big holes in users.id and I can't use something like MAX(id)

I'he try set random limit, example:

...
LIMIT 1234 , 1;
Empty set (2.71 sec)

But it takes much time too :(

How to get random 1 user which users.id isn't exists in proposals.to_id with good perfomance?

I think that I need to first get all profiles with a rand() and then filter them, but I do not know how to do it.

`RAND()` is hard to optimize. But [_here_](http://mysql.rjweb.org/doc.php/random) are some techniques. Most _other_ techniques do involve a table scan, which is the main killer. — Rick James, Apr 21 '17 at 17:01

score 1 · Accepted Answer · edited May 23 '17 at 12:25

I've two problem solutions.

1) With random id, from https://stackoverflow.com/a/4329447/2051938

SELECT *
FROM profiles AS r1
JOIN
    (SELECT CEIL(RAND() *
                     (SELECT MAX(id)
                        FROM profiles)) AS id)
        AS r2
WHERE
    r1.id >= r2.id
    AND
    r1.first_name IS NOT NULL
AND
NOT EXISTS (
    SELECT *
    FROM proposal
    WHERE
        proposal.to_id = r1.id
)
LIMIT 0 , 1

2) With ORDER BY RAND()

SELECT *
FROM
    (
        SELECT *
        FROM profiles
        WHERE
            profiles.first_name IS NOT NULL
        ORDER BY RAND()
    ) AS users
WHERE
    NOT EXISTS (
        SELECT *
        FROM proposal
        WHERE
            proposal.to_id = users.id
    )
LIMIT 0 , 1

First solution is faster but it've problem with "holes in id" and when you got id from the end (users may end earlier than there will be a match)

Second solution is slower but without flaws!

Stefano Zanini · Answer 2 · 2017-04-20T10:22:12.223

0

Have you tried switching not exists to left join?

SELECT DISTINCT *
FROM   profiles t1
LEFT JOIN
       proposal t2
ON     t1.id = t2.to_id
WHERE  t1.first_name IS NOT NULL AND
       t2.to_id IS NULL
ORDER BY RAND()
LIMIT 0 , 1

This will return you all rows of profiles, and to those that are not matched by a row in proposal it will assign NULL values, on which you can filter.

The result should be the same, but the performance may be better.

edited Apr 20 '17 at 10:22

answered Apr 20 '17 at 10:15

Stefano Zanini

5,876
2
13
33

I've error with it: `#1054 - Unknown column 'profiles.first_name' in 'where clause'` – mixalbl4 Apr 20 '17 at 10:21
I edited my answer, it was probably because it now requires table aliases to be used – Stefano Zanini Apr 20 '17 at 10:22
Your sql is less productive than simple add "order by rand()" :( result: `1 row in set (2.03 sec)` – mixalbl4 Apr 20 '17 at 10:37

Darshan Mehta · Answer 3 · 2017-04-20T10:35:49.080

0

As RAND() function assigns a random number to every row present in result, performance will be directly proportional to number of records.

If you want to select only one (random) record, you can apply LIMIT <random number from 0 to record count>, 1

e.g.:

SELECT u.id, count(u.id) as `count`
FROM users u
WHERE
    first_name IS NOT NULL
AND
NOT EXISTS (
    SELECT *
    FROM proposal
    WHERE
        proposal.to_id = u.id
)
LIMIT RAND(0, count-1) , 1

I haven't tried executing this query, however, it MySQL complains about using count in RAND, you can calculate count separately and substitute the value in this query.

edited Apr 20 '17 at 10:35

answered Apr 20 '17 at 10:22

Darshan Mehta

30,102
11
68
102

I've problem with it: `#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '*) as count FROM users u WHERE first_name IS NOT NULL AND NOT EXISTS ' at line 1` – mixalbl4 Apr 20 '17 at 10:25
Try changing it to `u.id` – Darshan Mehta Apr 20 '17 at 10:36
error: `#1327 - Undeclared variable: RAND`, how to fix it? – mixalbl4 Apr 20 '17 at 10:39
Get the count into a variable (with anoter query) and use the value in this query? – Darshan Mehta Apr 20 '17 at 15:10

score 0 · Answer 4 · answered Apr 20 '17 at 10:52

First, I don't think the select distinct is necessary. So, try removing that:

SELECT p.*
FROM profiles p
WHERE p.first_name IS NOT NULL AND
      NOT EXISTS (SELECT 1
                  FROM proposal pr
                  WHERE pr.to_id = p.id
                 )
ORDER BY rand()
LIMIT 0 , 1;

That might help a bit. Then, a relatively easy way to reduce the time spent is to reduce the data volume. If you know you will always have thousands of rows that meet the conditions, then you can do:

SELECT p.*
FROM profiles
WHERE p.first_name IS NOT NULL AND
      NOT EXISTS (SELECT 1
                  FROM proposal pr
                  WHERE pr.to_id = p.id
                 ) AND
      rand() < 0.01
ORDER BY rand()
LIMIT 0, 1;

The trick is to find the comparison value that ensures that you get at least one row. This is tricky because you have another set of data. Here is one method that uses a subquery:

SELECT p.*
FROM (SELECT p.*, (@rn := @rn + 1) as rn
      FROM profiles p CROSS JOIN
           (SELECT @rn := 0) params
      WHERE p.first_name IS NOT NULL AND
            NOT EXISTS (SELECT 1
                        FROM proposal pr
                        WHERE pr.to_id = p.id
                       ) 
     ) p
WHERE rand() < 100 / @rn
ORDER BY rand()
LIMIT 0, 1;

This uses a variable to calculate the number of rows and then randomly selects 100 of them for processing. When choosing 100 rows randomly, there is a very, very, very high likelihood that at least one will be chosen.

The downside to this approach is that the subquery needs to be materialized, which adds to the cost of the query. It is, however, cheaper than a sort on the full data.

it takes: `1 row in set (1.06 sec)` in 1000 users database, It's very slow, because I need to work with 10 000 000 users db :( — mixalbl4, Apr 20 '17 at 10:59

select 1 random row with complex filtering

4 Answers4

Linked