Python process keeps growing in django db upload script

Question

I'm running a conversion script that commits large amounts of data to a db using Django's ORM. I use manual commit to speed up the process. I have hundreds of files to to commit, each file will create more than a million objects.

I'm using Windows 7 64bit. I noticed the Python process keeps growing until it consumes more than 800MB, and this is only for the first file!

The script loops over records in a text file, reusing the same variables and without accumulating any lists or tuples.

I read here that this is a general problem for Python (and perhaps for any program), but I was hoping perhaps Django or Python has some explicit way to reduce the process size...

Here's an overview of the code:

import sys,os
sys.path.append(r'D:\MyProject')
os.environ['DJANGO_SETTINGS_MODULE']='my_project.settings'
from django.core.management import setup_environ
from convert_to_db import settings
from convert_to_db.convert.models import Model1, Model2, Model3
setup_environ(settings)
from django.db import transaction

@transaction.commit_manually
def process_file(filename):
    data_file = open(filename,'r')

    model1, created = Model1.objects.get_or_create([some condition])
    if created:
        option.save()

    while 1:
        line = data_file.readline()
        if line == '':
            break
        if not(input_row_i%5000):
            transaction.commit()
        line = line[:-1] # remove \n
        elements = line.split(',')

        d0 = elements[0]
        d1 = elements[1]
        d2 = elements[2]

        model2, created = Model2.objects.get_or_create([some condition])
        if created:
            option.save()

        model3 = Model3(d0=d0, d1=d1, d2=d2)
        model3 .save()

    data_file.close()
    transaction.commit()

# Some code that calls process_file() per file

score 3 · Accepted Answer · answered Nov 27 '10 at 18:39

First thing, make sure DEBUG=False in your settings.py. All queries sent to the database are stored in django.db.connection.queries when DEBUG=True. This will turn into a large amount of memory if you import many records. You can check it via the shell:

$ ./manage.py shell
> from django.conf import settings
> settings.DEBUG
True
> settings.DEBUG=False
> # django.db.connection.queries will now remain empty / []

If that does not help then try spawning a new Process to run process_file for each file. This is not the most efficient but you are trying to keep memory usage down not CPU cycles. Something like this should get you started:

from multiprocessing import Process

for filename in files_to_process:
    p = Process(target=process_file, args=(filename,))
    p.start()
    p.join()

You were absolutely right! In my first run the process ended up being more than 1GB in memory. Once I set DEBUG to False, the process stayed 13MB throughout the script. Thanks! — Jonathan Livni, Nov 27 '10 at 19:55

score 0 · Answer 2 · answered Nov 27 '10 at 18:22

It's difficult to say, what I would suggest is profile your code & see which section of your code is causing this memory surge.

After you know which part of the code is hogging memory you can think of reducing it.

Even after your efforts the memory consumption does not come down, you could do this - Since processes get memory allocation in chunks (or pages) & releasing them while the process is still running is difficult you could spawn a child process, do all your memory intensive tasks there & pass the results back to the parent process & die. This way the consumed memory (of child process) is returned back the the OS & your parent process stays lean...

Python process keeps growing in django db upload script

2 Answers2

Linked