0

My issue refers to this old post about importing a data set in Sqlite in chunks using a transaction SQLite transaction for CSV importing:

import csv, sqlite3, time

def chunks(data, rows=10000):

    for i in range (0, len(data), rows):

        yield data[i:i+rows]

if __name__ == "__main__":
    t = time.time()

con = sqlite3.connect('test.db')

cur = con.cursor()

cur.execute("DROP TABLE IF EXISTS sensor;")

cur.execute("CREATE TABLE sensor(key INT, reading REAL);")

filename = 'dummy.csv'

reader = csv.reader(open(filename,"r"))

divdata = chunks(list(reader))


for chunk in divdata:

    cur.execute('BEGIN TRANSACTION')

    for col1, col2 in chunk:

        cur.execute('INSERT INTO sensor (key, reading) VALUES (?, ?)', col1, col2))

    con.commit()

While the csv.reader reads the whole file in memory and then the file gets chopped by calling the function chunks, I am looking to a solution that reads the file in chunks (of say, 10k rows) and then each chunk is inserted in the Sqlite table like above until the whole file is inserted. How do I modify the above?

Community
  • 1
  • 1
boh
  • 77
  • 4
  • Why do you need to write it in chunks? Also are you aware that [sqlite has csv import already?](https://www.sqlite.org/cli.html#csv) – Colonel Thirty Two Jul 08 '15 at 14:06
  • Yea, I know that Sqlite can import csv files. In my case I have a very huge data set (>10 Gb) that can be best processed by Sqlite only in chunks. The code above suits my case, if only I knew a way not to make Python read the whole file in memory. – boh Jul 08 '15 at 15:37
  • `list(reader)` reads the entire thing into a list. Don't do that. Instead, iterate `for i, row in enumerate(reader)` and if `i == chunksize` commit and open a new transaction. – Colonel Thirty Two Jul 08 '15 at 15:39
  • I'm having trouble inserting the chunks into the created table at each iteration: ... chunksize = 1024 with open("C:/...", "rt") as f: for i, row in enumerate(f): if i == chunksize: cur.execute('BEGIN TRANSACTION') # how do I insert the chunks into the table? – boh Jul 09 '15 at 10:51

1 Answers1

0
divdata = chunks(list(reader))

list(reader) will iterate through the entire CSV file and store the results in a list. You don't want to do that, because the CSV file is huge.

Also, you don't want to run cur.execute("BEGIN TRANSACTION;"); the sqlite3 module does this for you.

Keep a counter while you're iterating through the CSV reader. Check the counter and use it to occasionally commit.

for counter, row in enumerate(reader):
    if counter % CHUNK_SIZE == 0:
        cur.commit()
    cur.execute("INSERT INTO ...")

cur.commit()

(Note: I don't think it's a good idea to chunk transactions like this. Maybe it helps, but you'd have to profile it first.)

Colonel Thirty Two
  • 23,953
  • 8
  • 45
  • 85
  • a couple of questions: 1. can you first explain the note to that snippet of code and what's profiling?; 2. how long would it take for the code above to run completely on a file of 11 Gb (after 4 fours I had to kill it)? – boh Jul 12 '15 at 11:51