1

I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.

The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction

Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.

I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between

When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:

c.execute('SELECT name FROM sqlite_master WHERE type="index"')

print(c.fetchone()) #returned none

The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.

abbreviated relevant code:

#process pre defined objects, files, retrieve data in batch - 
#all fine, no slowdown on full database

conn = sqlite3.connect(db_path)

c = conn.cursor()

table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)

targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items- 
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database

targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific 
#subject item combo

list(map(targeted_insert_data,table_breakdown)) # (3+) X slower

conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()

and relevant insert function:

def insert_data(c,tup):
    global collector ###list of tuples of data for a combo of a subject and item
    global sql_length ###pre defined dictionary translating the item into the 
    #right length (?,?,?...) string
    tbl_name=tup[0]
    subject=tup[1]
    item=tup[2]
    subject_data=collector[subject][item]
    if not (subject_data==[]):

        statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])

        c.executemany(statement,subject_data)#massively slower, about 80% of 
    #inserts > twice slower

        subject_data=[]

EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.

def create_tables(c,tup):
    global collector
    global title #list of column schemes to match to items
    tbl_name=tup[0]
    bm_unit=tup[1]
    item=tup[2]
    subject_data=bm_collector[bm_unit][item]

    if not (subject_data==[]):
        c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
        if c.fetchone()==None:
            c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))    

there are all told 65 different column schemes in the title dict but this is an example of what they look like:

title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))

Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.

Community
  • 1
  • 1
Eladriol
  • 99
  • 2
  • 10
  • Show the table definition. – CL. Nov 11 '15 at 20:49
  • 1
    What are you planning on doing with a sqlite table with 15 million rows and no index? It's going to be basically impossible to query at all. – Falmarri Nov 11 '15 at 20:51
  • Sorry @CL could you expand on what you need/mean by table defintion? – Eladriol Nov 11 '15 at 20:52
  • @Falmarri Absolutely impossible to query, or just pretty slow? I'm not going to be querying most tables (especially that one, it's pretty much a it's there if i need it) regularly or frequently, and I was under the impression that it was faster to not use indexes and then add them after doing the inserts – Eladriol Nov 11 '15 at 20:56
  • The CREATE TABLE statement. – CL. Nov 11 '15 at 20:58
  • @Eladriol I can't imagine that it will be possible to query unless you're just iterating over the entire table in the order that it's stored. Yes if you add indexes after the insertions you'll be able to query it. But 200 gigs is still very, very big for a sqlite db. Why are you using sqlite and not a full db server? – Falmarri Nov 11 '15 at 21:00
  • @CL I've added the function- there are 65 ish different schemes but I've given an example of one and they all look like that (with differing lengths and data types) – Eladriol Nov 11 '15 at 21:12
  • @Falmarri That specific table is a very simple one, it's just a timestamp/value one every 15 seconds. Because I was told to use Sqlite3, before I using a file system and CSVs- I really came into this cold so I have no idea on best practice. I had heard though that Sqlite3 could do databases that are multi terrabytes. – Eladriol Nov 11 '15 at 21:15

2 Answers2

1

Appending rows to the end of a table is the fastest way to insert data (and you are not playing games with the rowid, so you are indeed appending to then end).

However, you are not using a single table but 16k tables, so the overhead for managing the table structure is multiplied.

Try increasing the cache size. But the most promising change would be to use fewer tables.

CL.
  • 173,858
  • 17
  • 217
  • 259
  • thanks for the suggestions, I'll try them out with some iterations over the coming days and see if it helps. Do you know of any guidelines on how to set cache size; ie is there a rule of thumb related to system memory/table size etc? – Eladriol Nov 12 '15 at 00:03
  • The guideline is "measure". – CL. Nov 12 '15 at 08:28
  • initial tests seem to suggest upping the cache size does actually significantly help the speed of the writing (~factor of 2 to go from default to 10,000). Annoyingly I didn't implement some sort of logging when I first ran the script in full to create the database so it's difficult to fully benchmark, so I'll need to run some more tests. Are there any downsides to increasing cache size- apart from increasing the ability of sqlite to consume system resources? For example if I were to increase it to 100,000 would that have negative consequences? – Eladriol Nov 13 '15 at 10:34
  • should specify that this script is running on an individual work laptop with 4gb, not a third party phone or anything like that – Eladriol Nov 13 '15 at 11:01
  • Increasing SQLite's cache size will implicitly reduce the amount of memory used by the OS for its file cache. – CL. Nov 13 '15 at 11:59
  • Sorry to ask so many questions, but what are the implications of that? If I have it right that that would make the computer less responsive to other applications, but would that also have negative effects on the python/sqlite3 script hogging the cache due to some dependence on the OS using it's file cache? To put it another way would it ever be possible to allocate too high a cache limit to SQL to the point where it has negative consequences? – Eladriol Nov 13 '15 at 14:35
  • Sorry took so long to accept, I wasn't able to test for a while due to getting sidetracked by a bug caused by faulty RAM... Anyway your suggestion of less tables worked very well, around 2-3 times faster. Turns out cache size didn't effect things so much, but turning pragma sync off was also very helpful (it's not the end of the world if the database gets corrupted and I have to recreate it) Many thanks for your help. – Eladriol Dec 15 '15 at 17:24
0

It makes sense to me that the time to INSERT increases as a function of the database size. The operating system itself may be slower when opening/closing/writing to larger files. An index could slow things down much more of course, but that doesn't mean that there would be no slowdown at all without an index.

Galax
  • 1,441
  • 7
  • 6