4

I'm trying to use a Python script to parse the Wikipedia archives. (Yeah, I know.) Of course:

  • Wikipedia XML: 45.95 GB
  • Available memory: 16 GB

This precludes loading the file into memory, and going into virtual memory isn't going to fare much better. So in order to work with the data, I decided to parse the necessary information into a SQLite database. For the XML parsing, I used the ElementTree library, which performs quite well. I confirmed that running ONLY the XML parsing (just commenting out the database calls) it runs linearly, with no slowdowns as it traverses the file.

The problem comes with trying to insert MILLIONS of rows into the SQLite database (one per Wikipedia article). The simple version of the table that I'm using for testing is as follows:

CREATE TABLE articles(
    id INTEGER NOT NULL PRIMARY KEY,
    title TEXT NOT NULL UNIQUE ON CONFLICT IGNORE);

So I just have the id and a text field during this initial phase. When I start adding rows via:

INSERT OR IGNORE INTO articles(title) VALUES(?1);

it performs well at first. But at around 8 million rows in, it begins to slow down dramatically, by an order of magnitude or more.

Some detail is of course needed. I'm using cur.executemany() with a single cursor created before the insert statements. Each call to this function has a batch of about 100,000 rows. I don't call db.commit() until ALL of the million+ rows have been inserted. According to what I've read, executemany() shouldn't commit a transaction until db.commit() as long as there are only INSERT statements.

The source XML being read and the database being written are on two separate disks, and I've also tried creating the database in memory, but I see the slowdown regardless. I also tried the isolation_level=None option, adding the BEGIN TRANSACTION and COMMIT TRANSACTION calls myself at the beginning and end (so the entire parse sequence is one transaction), but it still doesn't help.

Some other questions on this site suggest that indexing is the problem. I don't have any indexes on the table. I did try removing the UNIQUE constraint and just limiting it to id INTEGER PRIMARY KEY and title TEXT NOT NULL but that also had no effect.

What's the best way to perform these types of insertions in SQLite for large data sets? Of course this simple query is just the first of many; there are other queries that will be more complex, involving foreign keys (ids of articles in this table) as well as insert statements with selects embedded (selecting an id from the articles table during an insert). These are bound to have the same problem but exacerbated by a large margin - where the articles table has less than 15 million rows, the other tables are probably going to have over a billion rows. So these performance issues are even more concerning.

user4157124
  • 2,809
  • 13
  • 27
  • 42
jstm88
  • 3,335
  • 4
  • 38
  • 55
  • Most likely SQLite is spilling over onto disk because the changes cannot be held in memory. See http://www.sqlite.org/tempfiles.html and http://www.sqlite.org/atomiccommit.html for how SQLite interacts with non-volatile storage. – Martijn Pieters Nov 14 '13 at 09:48
  • You could try and see if [WAL mode](http://www.sqlite.org/wal.html) helps improve performance. I've not used it, but at first glance I would certainly investigate the option. – Martijn Pieters Nov 14 '13 at 09:50
  • When you say "spilling over onto the disk" I assume you're talking about the transaction journal. In that case, would it be better to `COMMIT TRANSACTION` every 100k rows to make sure the uncommitted transactions never get large enough to be written to disk? I tried that and I don't think it changed anything. I also may try WAL mode but I want to try one thing at a time to figure out the true culprit. – jstm88 Nov 14 '13 at 10:08
  • TBH my remarks are guesses at best. If per transaction a lot of the already written pages need to be altered too then the same 'too many pages to fit in memory' behaviour would be present. But I haven't studied the SQLite memory model in enough detail. – Martijn Pieters Nov 14 '13 at 10:34
  • Well, I tried committing every 100k rows with no change. Then I tried WAL mode, which made it slower overall (2x slower for the first couple million rows) and the slowdown started happening at about 3M rows instead of closer to 8M. By my guess, since I'm only inserting, the 100k row batches shouldn't be changing significant portions of the already written parts, and the final database with all 1M+ rows is only about 1GB so with 16GB of memory I don't see the problem. – jstm88 Nov 14 '13 at 10:53
  • Sorry, then I am out of ideas for now; I'd be studying the SQLite documentation in more depth and see if I could monitor disk and memory usage over time, as well as anything else I could glean from SQLite debug options to figure out what is going on, from here on out. – Martijn Pieters Nov 14 '13 at 10:56
  • 3
    Try increasing the [cache size](http://www.sqlite.org/pragma.html#pragma_cache_size). – CL. Nov 14 '13 at 12:19
  • 1
    CL. - your comment kind of got hidden down there. That's what worked (You should add it as an answer so I can accept it) - turns out SQLite just had too small of a cache for my needs. When it grew to the point where the table was too large to fit in the cache entirely, it looks like it kept swapping parts of the table in and out repeatedly to check the primary key. I thought the primary key would be a bit "smarter" than that (since it's so common and it seems so simple, you'd think it's just a case of i++), but apparently not. ;) – jstm88 Nov 14 '13 at 15:17

1 Answers1

3

One "invisible" thing happening on insertion is updating a table's indices (and checking index-related constraints such as UNIQUE). Since you're ignoring UNIQUE violations anyway, you may find it useful to disable the indices on the table while you're loading the table, and if you really need them, build the indices once after the loading is complete.

But also beware that SQLite's lightning speed for small data comes from certain implicit assumptions that get increasingly violated when you're processing big data. It may not be an appropriate tool for your current problem on your current hardware.

dig
  • 449
  • 5
  • 7
  • I was under the [impression](http://stackoverflow.com/questions/788568/sqlite3-disabling-primary-key-index-while-inserting) that the primary key index was disabled during transactions and was only updated at the end of each transaction, thus why I tried to do it in a single one. I also disabled the `UNIQUE` constraint but that didn't improve the performance at all. I'm running a test right now using no id column (just text) and seeing how it works. – jstm88 Nov 14 '13 at 12:55
  • It shouldn't help, because internally, SQLite is using an invisible `integer primary key` anyway. Anyway, if the bottleneck is not indexing, then page cache churn (on either SQLite's level, OS memory manager's level, or even the processor's cache's level) is a likely cause for the dramatic drop in performance. – dig Nov 14 '13 at 13:12
  • Well the primary key column was the culprit, at least in part... it works very quickly without it. Unfortunately it forces me to create the table then create a new table to move the data to to give it a primary key again... I'm going to experiment with the cache sizes to see if I can get it to work without dropping the primary key. – jstm88 Nov 14 '13 at 14:37