Grotesque memory usage and leak with flask-sqlalchemy on Debian Wheezy

Question

I'm making a simple web app that pulls down (a lot? currently over 64000 fields in two tables and rising) of data from a MySQL database using flask-sqlalchemy. I only ran one instance and didn't notice any problems until I ran out of RAM and the thing grinded to a halt. I realize that

I might have to rethink how this entire thing works since I can't keep adding RAM (it's a small virtual machine so I'm up from 512 to 768 MB, but still)... but surly Python should release the memory after dealing with the request? Indeed, when running the same flask app on my Windows machine it sucks up RAM (but only half as much!) and when it's done it's released. Not so on the Debian machine which runs many (supposedly) tiny apps. As far as I know the lib versions running on both machines are the same. The Debian machine even has a later Python version than the Windows machine. They're both connecting to the same database. How do I proceed?

from flask import Flask, request, jsonify
from flask.ext.sqlalchemy import SQLAlchemy

import re
from datetime import datetime

app = Flask(__name__)
app.config.from_pyfile('settings.cfg')
db = SQLAlchemy(app)

class Reports(db.Model):
    __tablename__ = 'reports'

    id          = db.Column(db.Integer, primary_key=True)
    ip          = db.Column(db.Integer)
    date        = db.Column(db.DateTime)
    sid         = db.Column(db.Integer)
    version     = db.Column(db.Integer)
    itemname    = db.Column(db.String(25))
    group       = db.Column(db.Integer)
    pclass      = db.Column(db.String(15))
    ltime       = db.Column(db.Integer)
    rlen        = db.Column(db.Integer)
    total       = db.Column(db.Integer)

    def __init__(self, pd):
        self.date = datetime.utcnow()
        self.sid = pd["sid"]
        self.version = pd["version"]
        self.itemname = pd["itemname"]
        self.group = pd["group"]
        self.pclass = pd["pclass"]
        self.ltime = pd["ltime"]
        self.rlen = pd["rlen"]
        self.total = pd["total"]


class Perfdata(db.Model):
    __tablename__ = 'perfdata'

    reportid    = db.Column(db.Integer, db.ForeignKey('reports.id'), primary_key=True)
    l70 = db.Column(db.Integer)
    l65 = db.Column(db.Integer)
    l60 = db.Column(db.Integer)
    l55 = db.Column(db.Integer)
    l50 = db.Column(db.Integer)
    l45 = db.Column(db.Integer)
    l40 = db.Column(db.Integer)
    l35 = db.Column(db.Integer)
    l30 = db.Column(db.Integer)

    def __init__(self, reportid, pd):
        self.reportid = reportid
        self.l70 = pd["l70"]
        self.l65 = pd["l65"]
        self.l60 = pd["l60"]
        self.l55 = pd["l55"]
        self.l50 = pd["l50"]
        self.l45 = pd["l45"]
        self.l40 = pd["l40"]
        self.l35 = pd["l35"]
        self.l30 = pd["l30"]

    def buildlist(self):
        plist = []

        plist.append(self.l70)
        plist.append(self.l65)
        plist.append(self.l60)
        plist.append(self.l55)
        plist.append(self.l50)
        plist.append(self.l45)
        plist.append(self.l40)
        plist.append(self.l35)
        plist.append(self.l30)

        return plist


@app.route('/ps', methods=['GET'])
def perfget():

    response = []

    for report, perf in db.session.query(Reports, Perfdata).all():

        response.append("")

        response.append("%s %s %s %s %s %s %s %s" % (report.version,
                                                     report.sid,
                                                     report.itemname,
                                                     report.group,
                                                     report.pclass,
                                                     report.ltime,
                                                     report.rlen,
                                                     report.total))

        response.append("%s %s %s %s %s %s %s %s %s" % (perf.l70,
                                                     perf.l65,
                                                     perf.l60,
                                                     perf.l55,
                                                     perf.l50,
                                                     perf.l45,
                                                     perf.l40,
                                                     perf.l35,
                                                     perf.l30))

    return '<br>\n'.join(response)


if __name__ == '__main__':
    app.run()

So, you have 64000 table entries, which contain around 80 bytes per record. That's about 5MB in itself, but how it's stored in the Python code is could quite likely double the bytes per record, so 10MB for the data on its own. You are probably having more than one copy of all the data whilst it's being built, so I wouldn't be surprised if you have some 30-40MB per instance of your python code. So exactly what do you want us to do about that? You may want to grab only a portion of the recoards at a time, using "limit"? — Mats Petersson, Jan 22 '13 at 13:54
Python doesn't have a clear story as far as releasing memory to the OS once taken (see Martelli [here](http://stackoverflow.com/a/1316799/34549) and otherwise google for "free lists"), so your best bet is to not use that memory in the first place. The MySQL drivers will buffer all rows from a SELECT before returning them, and the SQLAlchemy "Query" object in the ORM also processes results fully by default before returning them (there's options to change this, with caveats). Reading smaller result sets is the best general approach. — zzzeek, Jan 23 '13 at 00:56
So Python on Linux not releasing memory is normal behaviour? I would have to switch to a Windows server if that is the case. — Neil Albarran, Jan 23 '13 at 12:15
No, what I think people are suggesting is that you re-think your approach to displaying and querying the data - there's a reason why pretty much everyone else returns a handful of results per page and requires the user to change pages. If you start thinking about doing your own memory management, or using the libraries in a pathological way, you will end up with considerably less hair. Perhaps you should describe your use case so people can suggest alternatives and best practices. — MerseyViking, Jan 29 '13 at 14:43
The use case is compiling a big list of data. I could do this with a cron job instead, saving the results to a static file. But it doesn't solve my problem, surly this is not how Python should work. — Neil Albarran, Jan 29 '13 at 15:41
So you're creating an API rather than serving web pages. Then you'll have to do some tests, but I would recommend writing your server code to have a "GetCapabilities" type request that sends back the maximum number of records it can serve in one request - and this can be set in a server-side config file. Then have a URL parameter that the client can specify the range with, and make them do several requests to get the entire dataset. This is fairly normal behaviour for a server such as this. Don't try to "fix" Python - there's probably a good reason you see this behaviour. — MerseyViking, Jan 29 '13 at 19:44
As for the MySQL driver buffering all the rows, you can prevent this by telling MySQL to stream the results. — Gewthen, Feb 10 '16 at 01:43

score 2 · Answer 1 · answered May 15 '13 at 11:07

2

Python may not know when to free the memory, so you can help it figure things out:

import gc
gc.collect()

answered May 15 '13 at 11:07

TTT

179
2
2

Grotesque memory usage and leak with flask-sqlalchemy on Debian Wheezy

1 Answers1