29

I have been using pandas in python and I usually write a dataframe to my db table as below. I am now migrating to Django, how can I write the same dataframe to a table through a model called MyModel? Assistance really appreciated.

# Original pandas code
    engine = create_engine('postgresql://myuser:mypassword@localhost:5432/mydb', echo=False)
    mydataframe.to_sql('mytable', engine, if_exists='append', index=True)
Vladislav Povorozniuc
  • 2,149
  • 25
  • 26
Avagut
  • 924
  • 3
  • 18
  • 34

3 Answers3

45

I'm just going through the same exercise at the moment. The approach I've taken is to create a list of new objects from the DataFrame and then bulk create them:

bulk_create(objs, batch_size=None)

This method inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are)

An example might look like this:

# Not able to iterate directly over the DataFrame
df_records = df.to_dict('records')

model_instances = [MyModel(
    field_1=record['field_1'],
    field_2=record['field_2'],
) for record in df_records]

MyModel.objects.bulk_create(model_instances)
Community
  • 1
  • 1
Jon Hannah
  • 557
  • 5
  • 5
  • Be aware of limitations, for instance: _"If the model’s primary key is an AutoField it does not retrieve and set the primary key attribute, as save() does, unless the database backend supports it (currently PostgreSQL). "_ – lmblanes May 29 '19 at 06:52
  • 2
    But again you are looping ```for record in df_records``` – Nitin Raturi Jun 05 '22 at 14:33
  • 2
    I think the whole point of the question is to avoid the for loop and calling the `MyModel` constructor – Courvoisier Jul 18 '22 at 09:31
  • FYI with Pandas 1.5.2, in my particular case, calling `df.to_dict(orient="records")` and iterating over its output turned out to be faster than calling `df.iterrows()` (~x1.9 slower) or `df.itertuples(index=False)` (~x1.1 slower). I used `%%timeit` for my tests. Of course, that probably depends on the DataFrame shape and content, but that gives you an order of magnitude. – scūriolus Dec 08 '22 at 09:05
34

Use your own pandas code along side a Django model that is mapped to the same SQL table

I am not aware of any explicit support to write a pandas dataframe to a Django model. However, in a Django app, you can still use your own code to read or write to the database, in addition to using the ORM (e.g. through your Django model)

And given that you most likely have data in the database previously written by pandas' to_sql, you can keep using the same database and the same pandas code and simply create a Django model that can access that table

e.g. if your pandas code was writing to SQL table mytable, simply create a model like this:

class MyModel(Model):
    class Meta:
        db_table = 'mytable' # This tells Django where the SQL table is
        managed = False # Use this if table already exists
                        # and doesn't need to be managed by Django

    field_1 = ...
    field_2 = ...

Now you can use this model from Django simultaneously with your existing pandas code (possibly in a single Django app)

Django database settings

To get the same DB credentials into the pandas SQL functions simply read the fields from Django settings, e.g.:

from django.conf import settings

user = settings.DATABASES['default']['USER']
password = settings.DATABASES['default']['PASSWORD']
database_name = settings.DATABASES['default']['NAME']
# host = settings.DATABASES['default']['HOST']
# port = settings.DATABASES['default']['PORT']

database_url = 'postgresql://{user}:{password}@localhost:5432/{database_name}'.format(
    user=user,
    password=password,
    database_name=database_name,
)

engine = create_engine(database_url, echo=False)

The alternative is not recommended as it's inefficient

I don't really see a way beside reading the dataframe row by row and then creating a model instance, and saving it, which is really slow. You might get away with some batch insert operation, but why bother since pandas' to_sql already does that for us. And reading Django querysets into a pandas dataframe is just inefficient when pandas can do that faster for us too.

# Doing it like this is slow
for index, row in df.iterrows():
     model = MyModel()
     model.field_1 = row['field_1']
     model.save()
bakkal
  • 54,350
  • 12
  • 131
  • 107
  • Thank you Bakkal, yes I have been using the pandas in python directly and through Qt. I was getting stuck at the setting up of the create_engine() values. I will try to set the values as picked from settings file. – Avagut Dec 23 '15 at 09:44
  • Awesome! Exactly what I needed! Thank you so much! – Avagut Dec 23 '15 at 10:25
  • i think there is a typo, database+name, it should be database_name – Shh Mar 09 '16 at 05:59
0

My solution using pickle and optionally zlib for compression

import pickle
#optional
#import zlib

class SuperModel(models.Model):
    DFToStore = models.BinaryField(default=None, null=True, blank=True)

    def save(self, *args, **kwargs):
        if not isinstance(self.DFToStore, (bytes)):
            self.DFToStore = pickle.dumps(self.DFToStore)
            #optional with compression
            #self.DFToStore = zlib.compress(pickle.dumps(self.DFToStore))
        super(SuperModel, self).save(*args, **kwargs)

    def get_DFToStore(self):
        if isinstance(self.DFToStore, (bytes)):
            return pickle.loads(self.DFToStore)
            #optional with compression
            #return pickle.loads(zlib.decompress(self.DFToStore))
        if not isinstance(self.DFToStore, (bytes)):
            return self.DFToStore