Get random record set with Django, what is affecting the performance？

Question

It said that

Record.objects.order_by('?')[:n]

have performance issues, and recommend doing something like this: (here)

sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]

Since that, why not do it directly like this:

result = random.sample(Record.objects.all(),n)

I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?

================Edit 2013-5-12 23:21 UCT+8 ========================

I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6

What I did was:

Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)

This is views.py:

def test1(request):  
    start = datetime.datetime.now()  
    result = Record.objects.order_by('?')[:20]  
    l = list(result) # Queryset是惰性的，强制将Queryset转为list  
    end = datetime.datetime.now()  
    return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))  

def test2(request):  
    start = datetime.datetime.now()  
    sample = random.sample(xrange(Record.objects.count()),20)  
    result = [Record.objects.all()[i] for i in sample]  
    l = list(result)  
    end = datetime.datetime.now()  
    return HttpResponse("time: <br/> %s" % (end-start)  

def test3(request):  
    start = datetime.datetime.now()  
    result = random.sample(Record.objects.all(),20)  
    l = list(result)  
    end = datetime.datetime.now()  
    return HttpResponse("time: <br/> %s" % (end-start)

As @Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data: test data

and the charts: SQL cost total cost

So, what's happened? In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:

(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` 
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()

Every one is right. And it used 22 seconds. And

sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]

actually did ths:

(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` 
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines

As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?

My think is:
If there is some way can speed up the large index query,

sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]

should be the best. Except(!) the table is smaller then 1m rows.

Hey, @shx2 I have tested, but got more confused. – CuGBabyBeaR Dec 05 '13 at 15:56 — CuGBabyBeaR, Dec 05 '13 at 15:56

score 2 · Answer 1 · edited Aug 27 '15 at 05:37

The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.

On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.

So that's the tradeoff.

Now this is a lot better:

sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]

simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).

However it may still be inefficient since .count might be slow (as usual: depends on DB).

something strange with brackets `result = [..()[..])..]` - one is lost — akaRem, Dec 05 '13 at 10:34

Yeo · Answer 2 · 2013-12-05T11:48:45.947

1

Record.objects.count() gets translated into very light SQL Query.

SELECT COUNT(*) FROM TABLE

Record.objects.all()[0] is also translated into a very light SQL Query.

SELECT * FROM TABLE LIMIT 1

Record.objects.all() usually the results get slice off to increase the performance

SELECT * FROM table LIMIT 20;  // or something similar

list(Record.objects.all()) will query all the data and put it into a list data structure.

SELECT * FROM TABLE

Thus, any time you convert a Queryset into a list, that's where the expensive happened

In your example, random.sample() will convert into a list. (If I'm not wrong).

Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.

Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one

edited Dec 05 '13 at 11:48

answered Dec 05 '13 at 06:48

Yeo

11,416
6
63
90

Performance of `.count` depends on DB. It is known to be **very** slow in Postgres. – freakish Dec 05 '13 at 06:53
I did some test this afternoon, You are right. But it seems that the first simple way is not so bad in mysql, especially in a small table. – CuGBabyBeaR Dec 05 '13 at 15:50
@CuGBabyBeaR To solve your problem, Certain question to be asked are... are you using MySQL too on your server? and another different question would be Is your server currently having a huge dataset or estimated to be growing very quickly? (check the number carefully) make sure you're not fall into the [premature optimization](http://c2.com/cgi/wiki?PrematureOptimization). – Yeo Dec 05 '13 at 16:25
@Yeo You make a good point. I'll handle it later, when i actually need it. – CuGBabyBeaR Dec 05 '13 at 17:03

Get random record set with Django, what is affecting the performance？

2 Answers2