0

My problem is that Django inserts entries waaaaaaay too slow ( i didnt even time but it was more than 5 mins) for 100k entries from Pandas csv file. What i am doing is parsing csv file and then save those objects to postgresql in Django. It is going to be a daily cronjob with csv files differ for most of the entries(some can be duplicates from the previous days or owner could be the same)

I haven't tried raw queries, but i dont think that would help much. and i am really stuck at this point honestly. apart from some iteration manipulations and making a generator, rather than iterator i can not somehow improve the time of insertions.

class TrendType(models.Model):
    """ Описывает тип отчета (посты, видео, субъекты)"""
    TREND_TYPE = Choices('video', 'posts', 'owners') ## << mnemonic

    title = models.CharField(max_length=50)
    mnemonic = models.CharField(choices=TREND_TYPE, max_length=30)


class TrendSource(models.Model):
    """ Источник отчета (файла) """
    trend_type = models.ForeignKey(TrendType, on_delete=models.CASCADE)
    load_date = models.DateTimeField()
    filename = models.CharField(max_length=100)


class TrendOwner(models.Model):
    """ Владелец данных (группа, юзер, и т.п.)"""
    TREND_OWNERS = Choices('group', 'user', '_')

    title = models.CharField(max_length=50)
    mnemonic = models.CharField(choices=TREND_OWNERS, max_length=30)


class Owner(models.Model):
    """ Данные о владельце """
    link = models.CharField(max_length=255)
    name = models.CharField(max_length=255)
    trend_type = models.ForeignKey(TrendType, on_delete=models.CASCADE)
    trend_owner = models.ForeignKey(TrendOwner, on_delete=models.CASCADE)


class TrendData(models.Model):
    """ Модель упаковка всех данных """
    owner = models.ForeignKey(Owner, on_delete=models.CASCADE)
    views = models.IntegerField()
    views_u = models.IntegerField()
    likes = models.IntegerField()
    shares = models.IntegerField()
    interaction_rate = models.FloatField()
    mean_age = models.IntegerField()
    source = models.ForeignKey(TrendSource, on_delete=models.CASCADE)
    date_trend = models.DateTimeField() # << take it as a date

Basically, i would love a good solution for a 'fast' insertion to a database and is it even possible given these models.

nexla
  • 434
  • 7
  • 20
  • 2
    What about to write directly from Pandas to database: https://stackoverflow.com/questions/23103962/how-to-write-dataframe-to-postgres-table ? Write data throw an ORM ( django query api in this case ) is always slow. Pandas is optimized to do this kind of operation. Make a search for `ETL` ( extract, transform, load ) – dani herrera Apr 06 '19 at 16:58
  • @daniherrera definitely will try this one out ! – nexla Apr 06 '19 at 17:06
  • Ok, let me know if it works to post it as answer in order to help other stack overflow users. – dani herrera Apr 06 '19 at 17:41
  • @daniherrera in the end we went with just raw queries, but i have tested pandas E˜TL and it works way better than django honestly, that is indeed a good method – nexla Apr 09 '19 at 10:03
  • What about bulk insert ( Paul Kovtun's answer )? – dani herrera Apr 09 '19 at 11:15

2 Answers2

0
  1. Maybe you don't need an ORM here? You can try to implement a simple wrapper around typical SQL requests
  2. Use bulk read/writing, using bulk_create() in Django ORM, or in your wrapper

Check https://docs.djangoproject.com/en/2.2/topics/db/optimization/

Pavel Kovtun
  • 367
  • 2
  • 8
0

The problem is not with django but rather with postgresql itself. My suggestion would be to change your backend. Postgresql is good for UPDATE data, but there are better DBs for INSERT data. Postgresql vs TimescaleDB However, I dont think there is django ORM for TimescaleDB.

My suggestion would be to use Redis. The primary use is cache in memory but you can make it to persist your data too. And there is also ORM for python with redis called ROM

Mazel Tov
  • 2,064
  • 14
  • 26