I have an SQLite database of the following format:
----------------------------
| id |Type0 | Type1 | ...
----------------------------
| id0 | 0-0 | 0-1 | ...
----------------------------
| id1 | 0-1 | 1-1 | ...
----------------------------
- Currently I have o(10^8) rows.
- Hence each time I create a new column I need to update o(10^8) values.
- I do the following inside a long for loop:
value = f'{i}-{j}'
id_column = 'id'
id_value = f'id{i}'
update_sql = """ UPDATE {} SET {} = '{}' where {} = '{}'""".format(table_name, column_name,value, id_column,id_value)
cur.execute(update_sql)
- This takes around 30 min.
- How can I take advantage of the generic nature of the entries to initialize new columns faster?
- I can generate an np.array of shape (10,000,000,1) in less than 1s which has ordering. Why can't I generate a column in that amount of time with entries location in the table?
- I read Improve INSERT-per-second performance of SQLite? and transactions seem a good way to speed up updates however I don't know how structure them for sqlite3.
UPDATE: 1. The following only takes 7s:
a = range(0,10000000)
b = [0]
c = [f'{x}-{y}' for x in a for y in b]
- The following only takes 2.84s where jit is the numba decorator:
@jit(nopython = True)
def go_fast(j):
a = np.linspace(0,10000000,10000001)
c = []
for i in range(0, len(a)):
c.append((a[i],j))
return c
- If creating such objects in runtime is so fast, why can't I receive similar performance for SQL?
Looking for something like this now:
listing = fast_list.go_fast(j)
update_sql =" #Some statement to update the column to entries of listing"
cur.execute(update_sql)
- The following method takes only 185s, so an order of magnitude better.
- id2 is number of the row
UPDATE table_name SET column_name = id2+j
This is much better but still not very fast, it will take 50h to do this 1000 times. Will probably also have to look into pairing functions more closely
Fiddling with PRAGMA modes gets me down to 88.32s:
cur.execute('pragma journal_mode= WAL;')
- This brings it down to 32.33s:
cur.execute('pragma cache_size= 3000')
cur.execute(' pragma mmap_size = 256000000')
- Is there any reasonable way I can make this even faster?