26

I'm trying to bulk insert into a MySQL db for a very large dataset and would love to use django's bulk_create while ignoring duplicate errors.

Sample model:

class MyModel(models.Model):
    my_id=models.IntegerField(primary_key=True)
    start_time = models.DateTimeField()
    duration = models.IntegerField()
    ......
    description = models.CharField(max_length=250)

so far I have the following code (generic for all my models, I pass in a Model_instance() and [list of bulk_create objects]):

def insert_many(model, my_objects):
    # list of ids where pk is unique
    in_db_ids = model.__class__.objects.values_list(model.__class__._meta.pk.name)
    if not in_db_ids:
        # nothing exists, save time and bulk_create
        model.__class__.objects.bulk_create(my_objects)
    else:
        in_db_ids_list = [elem[0] for elem in in_db_ids]

        to_insert=[]
        for elem in my_objects:
            if not elem.pk in in_db_ids_list:
                to_insert.append(elem)
        if to_insert:
            model.__class__.objects.bulk_create(to_insert)

Is there a way in django of doing this in order to avoid duplicates? mimicking MySQL's insert ignore would be great. If I simply use bulk_create (very fast), I get an error if there's a primary key duplicate and the insertion stops.

tr33hous
  • 1,622
  • 1
  • 15
  • 26
  • 2
    Have a look at [Django bulk_create with ignore rows that cause IntegrityError?][1] or [MySQL INSERT … ON DUPLICATE KEY UPDATE with django 1.4 for bulk insert][2] [1]: http://stackoverflow.com/questions/12451053/django-bulk-create-with-ignore-rows-that-cause-integrityerror [2]: http://stackoverflow.com/questions/12916880/mysql-insert-on-duplicate-key-update-with-django-1-4-for-bulk-insert – Roman Labunsky Aug 20 '13 at 12:44

2 Answers2

19

The ignore_conflicts parameter was added into bulk_create(Django 2.2)

and you can also find it in https://github.com/django/django/search?q=ignore_conflicts&unscoped_q=ignore_conflicts

Lord Elrond
  • 13,430
  • 7
  • 40
  • 80
gaozhidf
  • 2,621
  • 1
  • 22
  • 17
6

This function will do it.
Note: this will work only if you have unique pk and don't have anything else unique.

def insert_many(model, my_objects):
    # list of ids where pk is unique
    in_db_ids = model.__class__.objects.values_list(model.__class__._meta.pk.name)
    if not in_db_ids:
        # nothing exists, save time and bulk_create
        model.__class__.objects.bulk_create(my_objects)
    else:
        in_db_ids_list = [elem[0] for elem in in_db_ids]

        to_insert = []
        for elem in my_objects:
            if elem.pk not in in_db_ids_list and elem.pk not in to_insert:
                to_insert.append(elem)
        if to_insert:
            model.__class__.objects.bulk_create(to_insert)

How to use insert_many(MyModel(), list_of_myModels_defined_but_not_saved)

Artem Bernatskyi
  • 4,185
  • 2
  • 26
  • 35
Will Farley
  • 670
  • 1
  • 8
  • 19
  • 3
    This solution is not concurrency safe. It calculates the non-inserted rows in Python code and inserts them, but if the database is changed and one of those rows is added after the values_list() call, the bulk_create() will still raise an integrity error. – sirdodger Oct 19 '19 at 06:39