2

I already know that sort by random() its the worst way of retrieving a random row. I've implemented the solution of adding a random_number column and using that column when retrieving a random row, then on each retrieval I update the random_number. All this is used to attend a service that returns a random proxy IP:

select proxy_ip from proxy where random_number > 0.63 limit 1

0.63 is just an example of a random number generated inside the application.

The thing is that when I use the "worst" solution:

select proxy_ip from proxy
order by random()
limit 1

It appears to run faster when the service is called. The table contains 9300 rows, so my question is, how many rows a table must contain to make sort by random() the worst solution?

There is a little overhead introduced in the application that doesn't work directly with the db, instead it uses a data layer that in turn runs the queries, that explains a little why the better solution runs slow (besides it executes 1 more query against the db, not only 1 because it needs to update the random_number).

Results for explain analyze:

SORT BY RANDOM()

Limit  (cost=837.03..837.03 rows=1 width=18) (actual time=34.954..34.956 rows=1 loops=1)
  ->  Sort  (cost=837.03..860.46 rows=9373 width=18) (actual time=34.950..34.950 rows=1 loops=1)
        Sort Key: (random())
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on proxy  (cost=0.00..790.16 rows=9373 width=18) (actual time=0.013..17.951 rows=9363 loops=1)
Total runtime: 34.993 ms

Using random column:

Limit  (cost=0.00..0.23 rows=1 width=18) (actual time=0.038..0.045 rows=1 loops=1)
  ->  Seq Scan on proxy  (cost=0.00..790.16 rows=3481 width=18) (actual time=0.029..0.029 rows=1 loops=1)
        Filter: (random_number > 0.63::double precision)
Total runtime: 0.078 ms

The table has 1 index:

CREATE UNIQUE INDEX proxy_pkey ON proxy USING btree (id)
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Sergio Ayestarán
  • 5,590
  • 4
  • 38
  • 62
  • The explain shows that using the "random" column has the best performance, that's why i think that the data layer overhead is too much: SORT BY RANDOM() "Total runtime: 34.993 ms" Using random column "Total runtime: 0.078 ms" 34ms is acceptable right now, but as the table goes bigger i believe it can turn into an unacceptable 500ms for instance. – Sergio Ayestarán Dec 12 '12 at 18:30
  • Done mu is too short, I added the explain analyze results in the question, thank you! – Sergio Ayestarán Dec 12 '12 at 18:48
  • The SORT BY RANDOM() one seems to be taking quite a bit longer than the "random column" version (~35ms vs ~0.01ms), that's not surprising as sorting isn't free. – mu is too short Dec 12 '12 at 19:10

2 Answers2

1

See this post How to retrieve randomized data rows from a postgreSQL table?

Which links to a really bright Postgres guy's (Depesz) site with truck loads of great information. ->Depesz MY THOUGHTS ON GETTING RANDOM ROW

With this information try some different ways and see what works the best.

Community
  • 1
  • 1
Kuberchaun
  • 29,160
  • 7
  • 51
  • 59
1

Several thoughts...

  1. The answer to your question will be very hardware and implementation-specific. 9300 rows isn't very many on modern hardware. Chances are your entire table is stored in memory after the first read. So subsequent ORDER BY RANDOM() queries will be very quick.

  2. You're also hurting the performance of your random-number column by not indexing that column, which means that you're still having to essentially read the entire table to avoid... having to read the entire table.

    So add an index to your random_number column, and see how that helps.

  3. You may also be able to reduce the number of round-trips necessary by doing your update and select simultaneously with something like:

    UPDATE proxy
    FROM (
        SELECT id 
        FROM proxy
        ORDER BY random_number
        LIMIT 1
    ) AS r
    SET random_number=RANDOM()
    WHERE proxy.id=r.id
    RETURNING proxy.*
    
  4. You're not truly randomizing your proxy servers this way. Suppose you had 5 servers, A-E, and they were initially assigned random_numbers of 1-5:

    A: 1
    B: 2
    C: 3
    D: 4
    E: 5
    

    On your first run, you'll select server A, with a random_number of 1. Then you assign it a new random number, 1-5. Lets say you get 3:

    B: 2
    C: 3
    A: 3
    D: 4
    E: 5
    

    On a second run, you get B, and assign it a new random number, say 4:

    C: 3
    A: 3
    D: 4
    B: 4
    E: 5
    

    Then you get C, and give it a new random number, 2:

    C: 2
    A: 3
    D: 4
    B: 4
    E: 5
    

    It should be easy to see how you'll be starving some of your servers... any server "unlucky" enough to show up at the end of the list, will probably stay there for ever.

  5. A much better, and actually random approach, would be to assign each server a static number within a specified range, then select that number randomly (or psuedo-randomly with a hash). This is better for performance, as you're not doing a ton of writes, and it's actually random!

    SELECT proxy_ip
    FROM proxy 
    WHERE id=(RANDOM()*9300)::INT
    
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • Yes Flimzy, you're totally right. The reason that led me to use sort by random (even being the "worst") is what you explain in 4. I like your idea in 5, i'll do some tests to see what happens, but maybe 1 is explaining why the worst is working better. Thank you! – Sergio Ayestarán Dec 13 '12 at 15:04