4
  • I have a CSV file listing items that I need to store within the database.

  • I need to check which items are not already stored, and if not stored I need to save them within the database.

  • There are 2-5 million rows.


The model is Django's User model.

I have a CSV file of this form:

Item_ID, Surname, Policy_number, Sex, Title, Start_date

This is the code:

import csv

reader = csv.reader(open('items.csv', 'rb'))

for index, row in enumerate(reader):
    if User.objects.filter(username=row[2]).count():
        continue
    try:
        user = User(username=row[2],last_name=row[1],password='*')
        user.save()
    except Exception, e:
        print e
    del user
    del row
    del index

Any method you would recommend?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
RadiantHex
  • 24,907
  • 47
  • 148
  • 244
  • Why the `del`? That's handled automatically. There's rarely a good reason for that. What's your model definition? Specifically, what index options have you specified on the `username` attribute? – S.Lott Dec 13 '10 at 17:58

4 Answers4

2

Depends on the situation. If the csv data can be converted to a model, something like this should do:

  • load the csv data
  • for each row:
    • check if a model for it exists
    • if not:
      • create a new model based on the data and save it.

Edit:
I think a batch check for existence will be hard. A batch save of the models would be quicker, but depending on the model complexity I think it's saver to just do it per model.

Lucas Moeskops
  • 5,445
  • 3
  • 28
  • 42
  • Thanks for reply :) I have already done that: memory usage 1.1Gb after 600k rows and quite slow too :) – RadiantHex Dec 13 '10 at 14:48
  • @RadiantHex: If it's slow, you have to ask a different question. You'll need to provide the actual model and the actual code, and you'll need to be sure your lookups are simple "exact" match and you have indexes. Accept this answer, Search for "Django slow query", and -- if necessary -- open a new question. – S.Lott Dec 13 '10 at 14:57
  • @S.Lott: Thanks! :) I have updated the question with my code! – RadiantHex Dec 13 '10 at 15:22
  • Is it truly slow or does it just take a while because you're just processing a lot of data? There is a difference. – Andrew Sledge Dec 13 '10 at 15:30
2

Try this. The count() is going to be VERY expensive.

for index, row in enumerate(reader):
    try:
        User.objects.get(username=row[2])
    except User.DoesNotExist:
        user = User(username=row[2],last_name=row[1],password='*')
        user.save()
S.Lott
  • 384,516
  • 81
  • 508
  • 779
1

You will want to load the CSV file, then go through each row using the get_object_or_create method to check and see if the object exists, if not then it will create it for you. If you show us the models.py code we may be able to help you out more specifically.

TheLifeOfSteve
  • 3,208
  • 6
  • 37
  • 53
1

If your memory can handle the usernames variable, this might be a good optimisation.

import csv

reader = csv.reader(open('items.csv', 'rb'))

usernames = User.objects.values('username')

for index, row in enumerate(reader):
    if row[2] in usernames:
        continue
    User.objects.create(username=row[2],last_name=row[1],password='*')

If there really is a memory problem, you might take a look at for example this (existing) answer: Question about batch save objects in Django

Community
  • 1
  • 1
Lucas Moeskops
  • 5,445
  • 3
  • 28
  • 42