I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.
The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction
Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.
I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between
When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:
c.execute('SELECT name FROM sqlite_master WHERE type="index"')
print(c.fetchone()) #returned none
The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.
abbreviated relevant code:
#process pre defined objects, files, retrieve data in batch -
#all fine, no slowdown on full database
conn = sqlite3.connect(db_path)
c = conn.cursor()
table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)
targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items-
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database
targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific
#subject item combo
list(map(targeted_insert_data,table_breakdown)) # (3+) X slower
conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()
and relevant insert function:
def insert_data(c,tup):
global collector ###list of tuples of data for a combo of a subject and item
global sql_length ###pre defined dictionary translating the item into the
#right length (?,?,?...) string
tbl_name=tup[0]
subject=tup[1]
item=tup[2]
subject_data=collector[subject][item]
if not (subject_data==[]):
statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])
c.executemany(statement,subject_data)#massively slower, about 80% of
#inserts > twice slower
subject_data=[]
EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.
def create_tables(c,tup):
global collector
global title #list of column schemes to match to items
tbl_name=tup[0]
bm_unit=tup[1]
item=tup[2]
subject_data=bm_collector[bm_unit][item]
if not (subject_data==[]):
c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
if c.fetchone()==None:
c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))
there are all told 65 different column schemes in the title dict but this is an example of what they look like:
title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))
Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.