I already know that sort by random() its the worst way of retrieving a random row. I've implemented the solution of adding a random_number column and using that column when retrieving a random row, then on each retrieval I update the random_number. All this is used to attend a service that returns a random proxy IP:
select proxy_ip from proxy where random_number > 0.63 limit 1
0.63 is just an example of a random number generated inside the application.
The thing is that when I use the "worst" solution:
select proxy_ip from proxy
order by random()
limit 1
It appears to run faster when the service is called. The table contains 9300 rows, so my question is, how many rows a table must contain to make sort by random()
the worst solution?
There is a little overhead introduced in the application that doesn't work directly with the db, instead it uses a data layer that in turn runs the queries, that explains a little why the better solution runs slow (besides it executes 1 more query against the db, not only 1 because it needs to update the random_number).
Results for explain analyze:
SORT BY RANDOM()
Limit (cost=837.03..837.03 rows=1 width=18) (actual time=34.954..34.956 rows=1 loops=1)
-> Sort (cost=837.03..860.46 rows=9373 width=18) (actual time=34.950..34.950 rows=1 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on proxy (cost=0.00..790.16 rows=9373 width=18) (actual time=0.013..17.951 rows=9363 loops=1)
Total runtime: 34.993 ms
Using random column:
Limit (cost=0.00..0.23 rows=1 width=18) (actual time=0.038..0.045 rows=1 loops=1)
-> Seq Scan on proxy (cost=0.00..790.16 rows=3481 width=18) (actual time=0.029..0.029 rows=1 loops=1)
Filter: (random_number > 0.63::double precision)
Total runtime: 0.078 ms
The table has 1 index:
CREATE UNIQUE INDEX proxy_pkey ON proxy USING btree (id)