1

In my postgres database, I have the following relationships (simplified for the sake of this question):

Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)

Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag

Media
-----
n_id
t_media_path

So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):

select
    o.n_id,
    s.t_name,
    s.t_description,
    me.t_media_path
from
    objects o
    join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
    join media me on o.n_media_id = me.n_id
limit
    2

This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.

Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:

select
    o.n_id,
    s.t_name,
    s.t_description,
    me.t_media_path
from
    objects o
    join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
    join media me on o.n_media_id = me.n_id
order by
    random()
limit
    2

While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.

So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?

skaffman
  • 398,947
  • 96
  • 818
  • 769
Aleks G
  • 56,435
  • 29
  • 168
  • 265

4 Answers4

3

Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?

Michael Krelin - hacker
  • 138,757
  • 24
  • 193
  • 173
  • In order to use `offset` I would need to know the total number of records in the resultset - right? However just doing `select count(1)` from the group still takes over 1 second. – Aleks G Nov 28 '11 at 18:28
  • Yes, you may have it cached or something… Also, you can combine Alex's approach with this. Well, that's the best I could come up with ;) – Michael Krelin - hacker Nov 28 '11 at 18:30
  • No worries, I'm not complaining. Caching the count is not an easy option, as it's one of the transactional table, so the data there changes often (usually inserts and updates though, very few deletes). – Aleks G Nov 28 '11 at 18:39
  • I just assumed that since you need randomness, being a bit out of sync is not awfully bad (as long as you don't delete rows). – Michael Krelin - hacker Nov 28 '11 at 18:41
  • Ok, having gone through various iterations and testing of different approaches, I've got something that _kind of_, loosely based on Alex's answer. Thanks for your help – Aleks G Nov 28 '11 at 19:02
1

Here's some previous work done on the topic which may prove helpful:

http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
0

I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.

Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
  • Thanks for the suggestion. Just tried it - better, but still not good enough: about 600 ms. Thanks anyway, though. – Aleks G Nov 28 '11 at 18:24
  • Ok, having gone through several iterations, etc, I finally got something that _kind of_ works. As it's loosely based on your suggestion, I'll accept your answer. – Aleks G Nov 28 '11 at 19:02
0

It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.

The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.

I think you'd have to change the table schema to optimize for this case. How about something like:

  • 1) Create a new column with a sequence starting at 1.
  • 2) Every row will then have a number.
  • 3) Create an index on: number % 1000
  • 4) Query for rows where number % 1000 is equal to a random number between 0 and 999 (this should hit the index and load a random portion of your database)
  • 5) You can probably then add on RANDOM() to your ORDER BY clause and it will then just sort that chunk of your database and be 1,000x faster.
  • 6) Then select the first two of those rows.

If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.

Hopefully something along these lines could be very fast and decently random.

Mike Christensen
  • 88,082
  • 50
  • 208
  • 326