8

I'm using Python in order to save the data row by row... but this is extremely slow!

The CSV contains 70million lines, and with my script I can just store 1thousand a second.


This is what my script looks like

reader = csv.reader(open('test_results.csv', 'r'))
for row in reader:
    TestResult(type=row[0], name=row[1], result=row[2]).save()

I reckon that for testing I might have to consider MySQL or PostgreSQL.

Any idea or tips? This is the first time I deal with such massive volumes of data. :)

RadiantHex
  • 24,907
  • 47
  • 148
  • 244
  • sqlite is slower than mysql. mongoDB usually is faster for insertion, even though it has other limitations – nonopolarity Oct 18 '10 at 11:47
  • What is `TestResult`? You haven't shown any actual calls to sqlite3, so it's a bit difficult to comment on the code as it stands. What is the baseline reading time if you skip the `save()` and then if you replace the TestResult construction with `pass`? – Marcelo Cantos Oct 18 '10 at 11:49
  • Testresult is a Django ORM's Model subclass – RadiantHex Oct 18 '10 at 11:53
  • What does TestResult.save() do? Does it open a text file, save a row, and close the file again? Or something else obviously slow? – hughdbrown Oct 18 '10 at 15:07

2 Answers2

4

For MySQL imports:

mysqlimport [options] db_name textfile1 [textfile2 ...]

For SQLite3 imports:

ref How to import load a .sql or .csv file into SQLite?

Community
  • 1
  • 1
Andrew Sledge
  • 10,163
  • 2
  • 29
  • 30
3

I don't know if this will make a big enough difference, but since you're dealing with the Django ORM I can suggest the following:

  1. Ensure that DEBUG is False in your Django settings file, since otherwise you're storing every single query in memory.
  2. Put your logic in a main function, and wrap that in the django.db.transactions.commit_on_success decorator. That will prevent each row from needing it's own transaction, which will substantially speed up the process.
  3. If you know that all of the rows in the file do not exist in the database, add force_insert=True to your call to the save() method. This will halve the number of calls to sqlite needed.

These suggestions will probably make an even bigger difference if you do find yourself using a client-server DBMS.