4

I write a Djano application which deals with financial data process. I have to load large data(more than 1000000 records) from MySQL table, and convert the records to JSON data in django views as following:

trades = MtgoxTrade.objects.all()
data = []
for trade in trades:
            js = dict()
            js['time']= trade.time
            js['price']= trade.price
            js['amount']= trade.amount
            js['type']= trade.type
            data.append(js)
return data

The problem is that the FOR loop is very slow(which takes more than 9 seconds for 200000 records), is there any effective way to convert DB records to JSON format data in Python?

Updated: I have run code according to Mike Housky's answer in my ENV(ActivePython2.7,Win7) With code changes and result as:

def create_data(n):
    from api.models import MtgoxTrade
    result = MtgoxTrade.objects.all()

    return result


  Build ............ 0.330999851227
  For loop ......... 7.98400020599
  List Comp. ....... 0.457000017166
  Ratio ............ 0.0572394796312
  For loop 2 ....... 0.381999969482
  Ratio ............ 0.047845686326

You will find the for loop takes about 8 seconds! And if i comment out the For loop,then List Comp also takes such time as:

Times:
  Build ............ 0.343000173569
  List Comp. ....... 7.57099986076
  For loop 2 ....... 0.375999927521

My new question is that whether the for loop will touch the database? But i did not see any DB access log. So strange!

Simon Wang
  • 2,235
  • 7
  • 36
  • 48

6 Answers6

3

Here are several tips/things to try.

Since you need to make a JSON-string from the queryset eventually, use django's built-in serializers:

from django.core import serializers

data = serializers.serialize("json", 
                             MtgoxTrade.objects.all(), 
                             fields=('time','price','amount','type'))

You can make serialization faster by using ujson or simplejson modules. See SERIALIZATION_MODULES setting.

Also, instead of getting all the field values from the record, be explicit and get only what you need to serialize:

MtgoxTrade.objects.all().values('time','price','amount','type')

Also, you may want to use iterator() method of a queryset:

...For a QuerySet which returns a large number of objects that you only need to access once, this can result in better performance and a significant reduction in memory...

Also, you can split your huge queryset into batches, see: Batch querysets.

Also see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Sad but this don't seem to work when you're using a `GeometryField`. Just faced [one such situation today](http://stackoverflow.com/questions/18903652/get-django-geometryfields-coordinates-with-values). – Bibhas Debnath Sep 20 '13 at 12:48
  • I have used django serializers, but seems also no performance boost. And my MtgoxTrade table only have the above 4 fields. – Simon Wang Sep 20 '13 at 12:57
  • "trades = MtgoxTrade.objects.all()" only take less than 1 second, but the for loop takes more than 8 seconds! – Simon Wang Sep 20 '13 at 13:08
  • Yes. "MtgoxTrade.objects.all().values('time','price','amount','type')" this way does boost performance with 20% ratios! – Simon Wang Sep 21 '13 at 01:47
2

You can use a list comprehension as that prevents many dict() and append() calls:

trades = MtgoxTrade.objects.all()
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
        for trade in trades]
return data

Function calls are expensive in Python so you should aim to avoid them in slow loops.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • A list comprehension does not give you any significant speedup here. –  Sep 20 '13 at 11:38
  • Maybe this is implementation-dependent. On Win7, CPython 2.7.3 and 3.3.0 I see improvements of 25% or more for list comprehension over the for loop. I'll post the code in an answer. – Mike Housky Sep 20 '13 at 14:34
1

This answer is in support of Simeon Visser's observation. I ran the following code:

import gc, random, time
if "xrange" not in dir(__builtins__):
    xrange = range

class DataObject(object):
    def __init__(self, time, price, amount, type):
        self.time = time
        self.price = price
        self.amount = amount
        self.type = type

def create_data(n):
    result = []
    for index in xrange(n):
        s = str(index);
        result.append(DataObject("T"+s, "P"+s, "A"+s, "ty"+s))
    return result

def convert1(trades):
    data = []
    for trade in trades:
                js = dict()
                js['time']= trade.time
                js['price']= trade.price
                js['amount']= trade.amount
                js['type']= trade.type
                data.append(js)
    return data

def convert2(trades):
    data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
        for trade in trades]
    return data

def convert3(trades):
    ndata = len(trades)
    data = ndata*[None]
    for index in xrange(ndata):
        t = trades[index]
        js = dict()
        js['time']= t.time
        js['price']= t.price
        js['amount']= t.amount
        js['type']= t.type
        #js = {"time" : t.time, "price" : t.price, "amount" : t.amount, "type" : t.type}
    return data

def main(n=1000000):

    t0s = time.time()
    trades = create_data(n);
    t0f = time.time()
    t0 = t0f - t0s

    gc.disable()

    t1s = time.time()
    jtrades1 = convert1(trades)
    t1f = time.time()
    t1 = t1f - t1s

    t2s = time.time()
    jtrades2 = convert2(trades)
    t2f = time.time()
    t2 = t2f - t2s

    t3s = time.time()
    jtrades3 = convert3(trades)
    t3f = time.time()
    t3 = t3f - t3s

    gc.enable()

    print ("Times:")
    print ("  Build ............ " + str(t0))
    print ("  For loop ......... " + str(t1))
    print ("  List Comp. ....... " + str(t2))
    print ("  Ratio ............ " + str(t2/t1))
    print ("  For loop 2 ....... " + str(t3))
    print ("  Ratio ............ " + str(t3/t1))

main()

Results on Win7, Core 2 Duo 3.0GHz: Python 2.7.3:

Times:
  Build ............ 2.95600008965
  For loop ......... 0.699999809265
  List Comp. ....... 0.512000083923
  Ratio ............ 0.731428890618
  For loop 2 ....... 0.609999895096
  Ratio ............ 0.871428659011

Python 3.3.0:

Times:
  Build ............ 3.4320058822631836
  For loop ......... 1.0200011730194092
  List Comp. ....... 0.7500009536743164
  Ratio ............ 0.7352942070195492
  For loop 2 ....... 0.9500019550323486
  Ratio ............ 0.9313733946208623

Those vary a bit, even with GC disabled (much more variance with GC enabled, but about the same results). The third conversion timing shows that a fair-sized chunk of the saved time comes from not calling .append() a million times.

Ignore the "For loop 2" times. This version has a bug and I am out of time to fix it for now.

Mike Housky
  • 3,959
  • 1
  • 17
  • 31
0

First you have to check if the performance loss happens while fetching the data from the database or inside the loop.

There is no real option for giving you a significant speedup - also not using a list comprehension as noticed above.

However there is a huge difference in performance between Python 2 and 3.

A simple benchmark showed me that the for-loop is roughly 2,5 times faster with Python 3.3 (using some simple benchmark like the following):

import time

ts = time.time()
data = list()
for i in range(1000000):
    d = {}
    d['a'] = 1
    d['b'] = 2
    d['c'] = 3
    d['d'] = 4
    d['a'] = 5
    data.append(d)

print(time.time() - ts)



/opt/python-3.3.0/bin/python3 foo2.py 
0.5906929969787598

python2.6 foo2.py 
1.74390792847

python2.7 foo2.py 
0.673550128937

You will also note that there is a significant performance difference between Python 2.6 and 2.7.

  • I use Python2.7, your sample code run with less than 1 second, while my loop run more than 8 seconds only with 200000 records? – Simon Wang Sep 20 '13 at 12:45
0

I think it's worth trying to do a raw query against the database because a Model puts a lot of extra boilerplate code into fields (I belive that fields are properties) and like previously mentioned function calls are expensive. See the documentation, there is an example at the bottom of the page that uses dictfetchall which seems like the thing you are after.

hitripekac
  • 83
  • 8
0

You might want to look into the values method. It will return an iterable of dicts instead of model objects, so you don't have to create a lot of intermediate data structures. Your code could be reduced to

return MtgoxTrade.objects.values('time', 'price', 'amount', 'type')
Joseph Sheedy
  • 6,296
  • 4
  • 30
  • 31