0

I would like to retrieve a random set of X entries from my postgres database using sqlalchemy. My first approach was this

random_set_of_Xrows = models.Table.query.filter(something).order_by(func.random()).limit(len(X)).all()

since my Table is quite big, this command takes about 1 second, and I was wondering how to optimise it. I guess the order_by function requires to look at all rows, so I figured using offset instead might make it faster. However, I can't quite see how to avoid the row count entirely? Here is an approach using offset

rowCount = db.session.query(func.count(models.Table.id)).filter(something).scalar() 
random_set_of_Xrows = models.Table.query.offset(func.floor(func.random()*rowCount)).limit(len(X)).all()

which however is not faster, with most of the time spent getting rowCount. Any ideas how to make this faster? cheers carl

EDIT: As suggested below I added a column to the table with a random value and used that to extract the rows like

random_set_of_Xrows = models.Table.query.filter(something).order_by(models.Table.random_value).limit(len(X)).all()

I did ignore the offset part, since it doesn't matter to me if two calls give me the same results, I just need a random set of rows.

carl
  • 4,216
  • 9
  • 55
  • 103
  • does that help? https://stackoverflow.com/questions/60805/getting-random-row-through-sqlalchemy#60815 – hiro protagonist Dec 25 '16 at 13:09
  • yep I have seen that, but it does not really address the problem of avoiding the row count... if that is possible at all... – carl Dec 25 '16 at 13:10
  • hi hiro... sorry but what is the difference of this suggestion to my first approach? – carl Dec 25 '16 at 13:27
  • `how to avoid the row count entirely` - consider using an estimate instead of precise row count: `SELECT reltuples::bigint AS estimate FROM pg_class where relname='mytable';` see this answer for details: http://stackoverflow.com/questions/7943233/fast-way-to-discover-the-row-count-of-a-table-in-postgresql/7945274#7945274 – krokodilko Dec 25 '16 at 14:10

1 Answers1

1

I've optimized this before by adding an indexed column r that inserts a random value automatically when a row is created. Then when you need a random set of rows just SELECT * FROM table ORDER BY r LIMIT 10 OFFSET some_random_value. You can run a script that updates your schema to add this column to your existing rows. You'll add a slight performance hit to writes with this approach, but if this is a functionality you need persistently it should be a fair trade off.

mVChr
  • 49,587
  • 11
  • 107
  • 104
  • hi mVChr... thanks for the answer... I would have thought order_by needs to look at all rows? – carl Dec 25 '16 at 15:44
  • just to update my previous comment... I implemented it and it is not faster than my previous solutions... I added my exact implementation above – carl Dec 25 '16 at 16:34
  • apologies, your approach is faster... It turned out that the filter steps in my call where the reason it did not get faster... I will accept your answer – carl Dec 25 '16 at 16:42
  • @carl Because it's indexed the order of that column is known and doesn't need to be sorted at query time. – mVChr Dec 25 '16 at 18:45