Why is SQLite insertion cost linear?

Question

I had assumed that the time cost of insertion is log to the number of records. But my test in SQLite 3.22 seems showing it is linear. Note both X/Y are in log scale.

The size(k) column is the number of rows I inserted at each test. Its unit is K. I did 3 tests. Journal and synchronous are off. Locking_mode is exclusive. All operations are included in one transaction.

time1

create table t1 (id primary key, name text);
create index nameIdx on t1(name)
// for i = [1:<size>]
//   insert into t1 values(i, "foo"i)
create table t2 (id primary key, value int);
// for i = [1:<size>]
//   insert into t2 values(i, i)

time2

create table t1 (id primary key, name text);
// for i = [1:<size>]
//   insert into t1 values(i, "foo"i)
create index nameIdx on t1(name)
create table t2 (id primary key, value int);
// for i = [1:<size>]
//   insert into t2 values(i, i)

time3

create table t1 (id primary key, name text, value int);
// for i = [1:<size>]
//   insert into t1 values(i, "foo"i, 0)
create index nameIdx on t1(name)
// for i = [1:<size>]
//    update t1 set value=0 where id=<i>

All the 3 test cases have similar costs. They seem linear. Also I had thought case 3 can be faster because update does not need rebalance tree or add new records. But case3 is a little bit slower...

Are the results expected? Maybe my input data are too small to see the log complexity?

I deleted my answer when I realised that B-Trees have a midpoint head, not a MIN head, whoops! It renders most of my assertions about insertion time wrong. — Dai, Feb 14 '18 at 07:23
Also, a plot on a logarithmic scale that appears linear in fact represents a logarithmic distribution. — Mark Benningfield, Feb 14 '18 at 14:23
@MarkBenningfield both X and Y are in log scale. So the line means linearity. I use the latest 3.22 actually, fixed my question. — Joe C, Feb 14 '18 at 15:08

score 1 · Accepted Answer · answered Feb 14 '18 at 18:04

1

SQLite optimizes inserts at the end of the table. sqlite3BtreeInsert() in btree.c has:

/* If the cursor is currently on the last row and we are appending a
** new row onto the end, set the "loc" to avoid an unnecessary
** btreeMoveto() call */

To get worse run time, try inserting the rows in random order, or at least inserting the a very large value first.

Anyway, the run time is dominated by disk I/O, and the non-leaf pages are most likely to be cached. Use more transactions (so that all pages need to be flushed to disk), or use an in-memory database.

answered Feb 14 '18 at 18:04

CL.

173,858
17
217
259

When inserting ordered data, split/rebalance happens at non-leaf nodes. When inserting random data, split/rebalance happens more at leaf nodes. Because disk-IO dominates, so random data can show log complexity more likely. Is this how this works intuitively? – Joe C Feb 14 '18 at 18:23
First, measure how it actually behaves. – CL. Feb 14 '18 at 18:28

Why is SQLite insertion cost linear?

1 Answers1