Parsing large csv file to tinydb takes forever

Question

I have a large csv file containing 15 columns and approximately 1 million rows. I want to parse the data into tinyDB. The code I use is below:

import csv
from tinydb import TinyDB

db = TinyDB('db.monitor')
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i

However, it really takes forever. I have set the i variable to track the progress, and while in the first lines it runs fast, now its been more than an hour and it has parsed about 10000 lines at a pace of almost 1Hz

I couldn't find anything similar so any help would be appreciated

Thank you

this should probably be moved to http://codereview.stackexchange.com/ - and maybe you should check out ``insert_multiple()`` — Mike Scotty, Dec 23 '16 at 13:59
I had the same issue that inserts became really slow (like 1/sec) after 300 inserts. Installing ujson and using `insert_multiple` sped up my insert like orders of magnitude. Not measured exactly but like from 50 seconds to 1 or 2...So using `insert_multiple` is the key here! — klaas, Feb 17 '18 at 22:41

score 3 · Accepted Answer · answered Dec 23 '16 at 14:00

Is TinyDB the best choice ? You seem to need a transational database and TinyDB is document oriented. On top of that, from the doc : Wy not use TinyDB

If you need advanced features or high performance, TinyDB is the wrong database for you

Your process run really slow because you are accumulating data into the RAM. As a workaround, you could split your csv in smaller trunk and populate your script with it. This way, the memory could be clean between each iteration.

tinyDB is quite not able to manage this amount of informations.

score 3 · Answer 2 · answered Nov 04 '19 at 20:08

I had a similar problem this week whose solution was using a CachingMiddleware:

import csv
from tinydb import TinyDB
from tinydb.storages import JSONStorage
from tinydb.middlewares import CachingMiddleware

db = TinyDB('db.monitor', storage=CachingMiddleware(JSONStorage))
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i

score 1 · Answer 3 · answered Dec 23 '16 at 14:05

I never used tinyDB, but you could try below 2 options

Pandas to DB

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Pandas can read csv data in bulk and faster than reading line by line from file

Read csv file in chunk instead of line by line and then create insert statement for that instead of an insert statement for each line.

score 0 · Answer 4 · answered Jun 15 '20 at 17:33

TinyDB is going to be slow, as it loads everything into RAM. I'm not sure of any good alternatives, though you may be interested in this SQLite-backed document store library I wrote a while back:

https://github.com/skorokithakis/goatfish

You can create indexes on some fields, and it turns everything into SQL queries, so performance is pretty decent.

rbf22 · Answer 5 · 2020-08-10T01:39:56.557

It is way faster to build the file by hand if you are parsing the data from a csv file. I would suggest something like the following:

import json
counter = 0
with open(csv_file) as fh, open('test.db','w') as db:
    keys = None
    db.write('{"_default": {')
    spacer = ''
    for line in fh:
        if not keys:
            # This is the header row
            keys = line[:-1].replace(" ", "_").split('\t')
        else:
            counter += 1
            # These are the data rows
            values = line[:-1].split('\t')
            db.write('{}"{}":{}'.format(spacer,counter,json.dumps(dict(zip(keys, values)))))
            spacer = ','
        if counter % 1000 == 0:
            print( 'counter: {:10d}'.format(counter), end='\r')
    db.write('}}')

score 0 · Answer 6 · answered Feb 07 '22 at 00:02

User Pandas to write to the Json file which will be a lot quicker.

generated_rules.to_json('./Data/rules.json', orient='index', index='true')

Then edit the generated json file and add table or default in json file eg

Before

{"0": {...}, "1":{...}

After

{rules:{"0": {...}, "1":{...}}

then just read this json file as tiny db.

Parsing large csv file to tinydb takes forever

6 Answers6