How to use SELECT DISTINCT with RANDOM() function in PostgreSQL?

Question

I am trying to run a SQL query to get four random items. As the table product_filter has more than one touple in product i have to use DISTINCT in SELECT, so i get this error:

for SELECT DISTINCT, ORDER BY expressions must appear in select list

But if i put RANDOM() in my SELECT it will avoid the DISTINCT result.

Someone know how to use DISTINCT with the RANDOM() function? Below is my problematic query.

SELECT DISTINCT
    p.id, 
    p.title
FROM
    product_filter pf
    JOIN product p ON pf.cod_product = p.cod
    JOIN filters f ON pf.cod_filter = f.cod
WHERE
    p.visible = TRUE
LIMIT 4
ORDER BY RANDOM();

Why do you have to join to `product_filter`? Doesn't seem like you use it anywhere in your query. — Eric Petroelje, Jul 09 '12 at 18:53
@EricPetroelje, I've reduced the query, it is more complex, but basically i am doing other joins using the table `product_filter`. I've made an edit in the question to ilustrate it better. — Marcio Mazzucato, Jul 09 '12 at 18:58

score 13 · Answer 1 · answered Jul 09 '12 at 19:39

13

You either do a subquery

SELECT * FROM (
    SELECT DISTINCT p.cod, p.title ... JOIN... WHERE
    ) ORDER BY RANDOM() LIMIT 4;

or you try GROUPing for those same fields:

SELECT p.cod, p.title, MIN(RANDOM()) AS o FROM ... JOIN ...
    WHERE ... GROUP BY p.cod, p.title ORDER BY o LIMIT 4;

Which of the two expressions will evaluate faster depends on table structure and indexing; with proper indexing on cod and title, the subquery version will run faster (cod and title will be taken from index cardinality information, and cod is the only key needed for the JOIN, so if you index by title, cod and visible (used in the WHERE), it is likely that the physical table will not even be accessed at all.

I am not so sure whether this would happen with the second expression too.

answered Jul 09 '12 at 19:39

LSerni

55,617
10
65
107

The solution using GROUP BY is simple, i like it. But i didn't understand why you used the `MIN()` function. – Marcio Mazzucato Jul 09 '12 at 19:43
Oh, that's because any non-GROUP BY columns must be aggregate functions. RANDOM() isn't, but MIN(RANDOM()) is, and of course the min of a random distribution is still random. – LSerni Jul 09 '12 at 20:10
Really like the GROUP BY solution. Easy to implement with ActiveRecord. – littleforest Oct 14 '14 at 21:29

Erwin Brandstetter · Accepted Answer · 2023-05-13T03:32:29.810

4

Simplify the query to avoid the problem a priori:

SELECT p.cod, p.title
FROM   product p
WHERE  p.visible
AND    EXISTS (
    SELECT 1
    FROM   product_filter pf
    JOIN   filters f ON f.cod = pf.cod_filter
    WHERE  pf.cod_product = p.cod
    )
ORDER  BY random()
LIMIT  4;

You have only columns from table product in the result, other tables are only checked for existence of a matching row. For a case like this an EXISTS expression is likely fastest and simplest. It does not multiply rows from the base table product, so you don't need to remove them again with DISTINCT.

LIMIT has to come last, after ORDER BY.

I simplified WHERE p.visible = 't' to just WHERE p.visible, same thing, only cheoper for a boolean column.

edited May 13 '23 at 03:32

answered Jul 09 '12 at 20:48

Erwin Brandstetter

605,456
145
1,078
1,228

Interesting solution! I am not familiar with `EXISTS`, but seems to be a optimized option. Please tell me just one thing, `SELECT 1` works like a boolean result? – Marcio Mazzucato Jul 09 '12 at 21:08
@MarcioSimao: `SELECT 1` is mostly irrelevant here. You can have `SELECT *` or even `SELECT NULL` in its place. [More in this related question](http://stackoverflow.com/q/7710153/939860). – Erwin Brandstetter Jul 09 '12 at 21:15

score 2 · Answer 3 · edited Mar 17 '19 at 06:04

2

Use a subquery. Don't forget the table alias, t. LIMIT comes after ORDER BY.

    SELECT *
    FROM (SELECT DISTINCT a, b, c
          FROM datatable WHERE a = 'hello'
         ) t
    ORDER BY random()
    LIMIT 10;

edited Mar 17 '19 at 06:04

MarredCheese

17,541
8
92
91

answered Mar 17 '19 at 04:39

embulldogs99

840
9
9

score 0 · Answer 4 · answered Jul 09 '12 at 18:56

I think you need a subquery:

select *
from (select DISTINCT p.cod, p.title
      from product_filter pf  join
           product p
           on pf.cod_product = p.cod
      where p.visible = 't'
     ) t
LIMIT 4
order by RANDOM()

Calculate the distinct values first, and then do the limit.

Do note, this does have performance implications, because this query does a distinct on everything before selecting what you want. Whether this matters depends on the size of your table and how you are using the query.

score 0 · Answer 5 · edited Jul 09 '12 at 19:28

0

SELECT DISTINCT U.* FROM
(

    SELECT p.cod, p.title FROM product__filter pf

      JOIN product p on pf.cod_product = p.cod
      JOIN filters f on pf.cod_filter = f.cod

    WHERE p.visible = 't' 

    ORDER BY RANDOM()

) AS U

LIMIT 4

This does the RANDOM first then the LIMIT afterwards.

edited Jul 09 '12 at 19:28

Marcio Mazzucato

8,841
9
64
79

answered Jul 09 '12 at 18:57

Holger Brandt

4,324
1
20
35

I would be interested in hearing from those more knowledgeable about Postgres. However, I don't think this will return random rows. I would guess that the "distinct" will sort the table and the limit will return the first four rows based on the sort sequence. – Gordon Linoff Jul 09 '12 at 19:15
@GordonLinoff, I agree. I think your solution keeps the random rows, while this solution can compromise the random rows because the `RANDOM()` function is called with repetead values and it depends of how DISTINCT will sort the rows. But anyway, many thanks for your answer @HolgerBrandt, it helps me a lot! – Marcio Mazzucato Jul 09 '12 at 19:35

How to use SELECT DISTINCT with RANDOM() function in PostgreSQL?

5 Answers5