Django adaptors CSV taking hours to import

Question

I'm using Django adaptors to upload a simple CSV. It seems to work perfectly when I'm importing 100 or 200 contacts. But when I try to upload a 165kb file with 5000 contacts, it never completes. I let let it keep trying, and when I came back after 1 hour it was still trying.

What's wrong with this? There is no way a 165kb file should take over an hour to import with Django adaptors. Is there something wrong with the code?

 def process(self):
        self.date_start_processing = timezone.now()
        try:


            # Try and import CSV
            ContactCSVModel.import_data(data=self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

            self._mark_processed(self.num_records)
        except Exception as e:
            self._mark_failed(unicode(e))

CsvModel

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"
        dbModel = Contact
        update = {'keys': ["mobile", "group"]}

Don't know this app. Is there a way to set a batch size to process 500 items by 500? Is there any log happening while it's running? HAve you tried just to cut it into 2 (2500) and see what happens? Do you have an activity monitor app to check the memory and processor usage, what's happening when you are running the script? — François Constant, Apr 15 '13 at 00:50
@Spike Might be all the disk flushes if you're bulk loading with InnoDB. Have you tried doing a `SET GLOBAL innodb_flush_log_at_trx_commit=2;`? — Aya, Apr 18 '13 at 08:34
Not sure if that's the issue. **group** is a reserved word, you should avoid it in your code (or at least define a different column name for it). — François Constant, Apr 18 '13 at 07:18

score 7 · Accepted Answer · edited May 23 '17 at 11:43

Split up your larger task into smaller pieces.

Step 1 - Just read a CSV file

Both ContactCSVModel.import_from_filename() and ContactCSVModel.import_from_file() return the csv lines. Disable the interaction with your django model to skip interaction with your database. This should speed up the task considerably and print the imported data. This should definitely work!

CSVModel

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"

Your code

 def process(self):
        self.date_start_processing = timezone.now()
        try:


            # Try and import CSV
            lines = ContactCSVModel.import_data(data=self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])
            print lines # or use logging

            self._mark_processed(self.num_records)
        except Exception as e:
            self._mark_failed(unicode(e))

Step 2 - enable django model interaction BUT disable to check for existing items in DB.

Disable it because this feature enabled would query the DB for every line in CSV to check for existing items according to your natural key specification (I have read the source code). Probably you know that all lines in your CSV are unique contacts.

This would help if your problems are slow DB queries during the whole import, but does not really help if the import consumes too much memory.

class ContactCSVModel(CsvModel):

    first_name = CharField()
    last_name = CharField()
    company = CharField()
    mobile = CharField()
    group = DjangoModelField(Group)
    contact_owner = DjangoModelField(User)


    class Meta:
        delimiter = "^"
        dbModel = Contact

Step 3 - Import equally sized chunks of CSV

Use the CSVModel and enable interaction with Contact model, but provide smaller iterables to ContactCSVModel.import_data(). I set it to 500. Change it to your needs. The code sample below (link) is to get you the idea. You need to change it a bit to put this into your existing code. This will help, if memory consumption is the problem.

import csv
reader = csv.reader(open(self.filepath, 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader, chunksize=500):
    ContactCSVModel.import_data(data=chunk, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

Step 4 - Target large memory consumption and slow operation

Because django-adaptors holds all Contact model instances in memory during import and slow operation because of multiple single commits instead of bulk insert operation - it is not well suited for larger files.

You are somewhat tied to django-adaptors. You can't switch to bulk inserts if you rely on this django package. Check the memory consumption under linux with top or htop, on windows with task manager. If the process eats to much and the OS starts swapping, switch to another django add-on with more efficient memory consumption and bulk inserts as an option - there are plenty of them for csv imports.

Another hint is to use the csv module for reading and your django Models knowledge for interacting with the database. This is not really a challenge for you - just try it with isolated tasks of your big picture and put them together if they are working - good luck.

wow, thanks for this, a lot to process so I'll take a look. really well explained again thanks — Prometheus, Apr 19 '13 at 08:36
Indeed, django-adaptors is holding the whole file in memory because that's not using a generator to read lines. I planned to have a look at that soonish. — trez, May 08 '13 at 06:53

alecxe · Answer 2 · 2013-04-18T17:12:16.777

2

The first thing to try is to pass an iterable to the import_data function:

ContactCSVModel.import_data(open(self.filepath), extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

The second thing to try is to use import_from_filename:

ContactCSVModel.import_from_filename(self.filepath, extra_fields=[
                {'value': self.group_id, 'position': 5},
                {'value': self.uploaded_by.id, 'position': 6}])

If this doesn't help, try to figure out where is it hanging. You can do it manually by reducing the size of your csv file, or you can put a mock on csv.reader, or you can mock CsvImporter.process_line and instead of processing lines, print out them to see where does it stop. Let me know if you need help with mocking.

Also, this issue can be related.

edited Apr 18 '13 at 17:12

answered Apr 18 '13 at 10:41

alecxe

462,703
120
1,088
1,195

The issue on Github is very interesting. – Prometheus Apr 18 '13 at 16:05
I get coercing to Unicode: need string or buffer, FieldFile found when just passing the open(upload.filepath) oops – Prometheus Apr 18 '13 at 16:08
If I focus on the issue on Git do you have a solution for that? – Prometheus Apr 18 '13 at 16:14
I don't have a simple solution for the git issue. I've edited the post: included one more option (`import_from_filename`), please check. – alecxe Apr 18 '13 at 17:13

score 2 · Answer 3 · answered Apr 18 '13 at 12:40

I would first check there are no data errors in the csv. Eg if a column has erroneous escape characters or incorrect data types - perhaps the DB cannot accept null values on some columns.

Whilst it is hanging, can you manually check to see if the DB is populating? Either through command line MySQL prompt or workbench? If it is, then auto-commit is turned on and you should be able to see what row it is hanging on - then check that record in the CSV.

However, if auto-commit is turned off (I' don't know what Django does by default, or how your DB is configured) then it is possible you are overflowing the transaction buffer. There should be a way to manually flush/commit the transaction in stages to get around this.

yes I can see it's commit but very slowly like one row every 1-2 mins, on small files this in around 1-2 seconds! — Prometheus, Apr 18 '13 at 16:09

score 2 · Answer 4 · answered Apr 18 '13 at 17:55

I don't know much about django-adaptors, but some things that have helped me when database imports are slow are to use the @transaction.commit_manually() decorator on a method, or using the Model.objects.bulk_create() method. For you, it looks like that commit_manually method might help, but the bulk_create method wouldn't, because you aren't actually controlling the creation process.

Django adaptors CSV taking hours to import

4 Answers4