103

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:

conn = sqlite3.connect("mydatabase.db")
  • If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
  • If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.

Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?

Noah
  • 15,080
  • 13
  • 104
  • 148
RexE
  • 17,085
  • 16
  • 58
  • 81
  • 5
    I'd like to note that more recent versions of Python include newer versions of sqlite3 that should fix this problem. – hopti May 01 '12 at 17:45
  • @RyanFugger do you know what the earliest version that supports this is? I'm using 2.7 – notbad.jpeg Jun 25 '13 at 22:34
  • @RyanFugger AFAIK there's no pre-built version that contains a newer version of SQLite3 that has that fixed. You can build one yourself, though. – shezi Nov 08 '13 at 13:36

15 Answers15

204

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.

This can be enabled via optional keyword argument check_same_thread:

sqlite.connect(":memory:", check_same_thread=False)
Erik Kaplun
  • 37,128
  • 15
  • 99
  • 111
Jeremiah Rose
  • 3,865
  • 4
  • 27
  • 31
  • 6
    I've encountered unpredictable exceptions and even Python crashes with this option (Python 2.7 on Windows 32). – reclosedev May 04 '12 at 11:19
  • 6
    According to the [docs](http://www.sqlite.org/threadsafe.html), in multi-threaded mode no single database connect can be used in multiple threads. There's also a serialised mode – Casebash May 13 '13 at 00:40
  • Can someone link to the specific part of the documentation here? – Igor Medeiros Sep 27 '13 at 03:15
  • 2
    Never mind, just found it: [http://sqlite.org/compile.html#threadsafe](http://sqlite.org/compile.html#threadsafe) – Igor Medeiros Sep 27 '13 at 03:17
  • @reclosedev can you tell more about your exceptions? – FrEaKmAn Sep 15 '14 at 16:40
  • 1
    @FrEaKmAn, sorry, it was long time ago, also not :memory: database. After that I didn't share sqlite connection in multiple threads. – reclosedev Sep 15 '14 at 18:33
  • 3
    @FrEaKmAn, I've encountered this, with the python process core-dumping on multi-thread access. The behavior was unpredictable, and no exception was logged. If I remember correctly, this was true for both reads and writes. This is the one thing I've seen actually crash python thus far :D. I've not tried this with sqlite compiled in threadsafe mode, but at the time, I did not have the liberty of recompiling the system's default sqlite. I ended up doing something akin to what Eric suggested and disabled thread compatibility – verboze Feb 25 '15 at 17:29
  • @verboze U don't need to break the current sqlite3 module. Get a copy then put in the proper folder on your proyect, then import from that instead. By other hand, the hability to skip thread checks should be done by the app that creates it, not from the new ones asking for access :C – m3nda Jun 21 '16 at 07:29
  • In python3.6 check_same_thread seems to be ok so far on win32 with tons of threads. – Erik Aronesty Feb 07 '18 at 22:28
  • 7
    **WARNING**: The Python docs have [this](https://docs.python.org/3/library/sqlite3.html#sqlite3.connect) to say about the `check_same_thread` option: "When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption." So yes, you _can_ use SQLite with multiple threads as long as your code ensures that only one thread can write to the database at any given time. If it doesn't, you might corrupt your database. – Ajedi32 Sep 02 '20 at 16:07
  • 2
    @Ajedi32 what about writing and reading simultaneously? The documentation is not clear on this one and I could interpret it either way. Do you happen to have experience using Python's sqlite3 module with 1 writing thread and one or more simultaneous reading threads? Will it crash or work fine? – blubberdiblub Mar 31 '21 at 09:30
51

You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.

Erik Kaplun
  • 37,128
  • 15
  • 99
  • 111
Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
  • 10
    FWIW: Later versions of sqlite claim you can share connections and objects across threads (except cursors), but I've found otherwise in actual practice. – Richard Levasseur Feb 14 '09 at 02:22
  • [Here](http://code.activestate.com/recipes/526618/) is an example of what Evgeny Lazin mentioned above. – dugres May 03 '09 at 08:21
  • 6
    Hiding your database behind a shared queue is a really bad solution to this question because SQL in general and SQLite specifically *already* have built-in locking mechanisms, which are probably much more refined than anything you can build ad-hoc by yourself. – shezi Nov 08 '13 at 13:35
  • 2
    You need to read the question, at that moment there was no built-in locking mechanisms. Many contemporary embedded databases are lacks this mechanism for performance reasons (for example: LevelDB). – Evgeny Lazin Nov 15 '13 at 14:39
19

The following found on mail.python.org.pipermail.1239789

I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function and we will be able to create cursors out of it in different thread. So use:

sqlite.connect(":memory:", check_same_thread = False)

works out perfectly for me. Of course from now on I need to take care of safe multithreading access to the db. Anyway thx all for trying to help.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Robert Krolik
  • 199
  • 1
  • 2
  • (With the GIL, there really isn't much in the way of true multithreaded access to the db anyway that ive seen) – Erik Aronesty Feb 07 '18 at 22:29
  • 1
    **WARNING**: The Python docs have [this](https://docs.python.org/3/library/sqlite3.html#sqlite3.connect) to say about the `check_same_thread` option: "When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption." So yes, you _can_ use SQLite with multiple threads as long as your code ensures that only one thread can write to the database at any given time. If it doesn't, you might corrupt your database. – Ajedi32 Sep 02 '20 at 16:07
16

Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.

Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:

  1. SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
  2. The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.
Prof. Falken
  • 24,226
  • 19
  • 100
  • 173
nosklo
  • 217,122
  • 57
  • 293
  • 297
12

You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.

Use only one thread, and have the completion of the request trigger an event to do the write.

twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).

Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.

I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.

Dustin
  • 89,080
  • 21
  • 111
  • 133
8

Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.

For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.

Ali Afshar
  • 40,967
  • 12
  • 95
  • 109
4

You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.

anoopknr
  • 3,177
  • 2
  • 23
  • 33
3

Use threading.Lock()

Alexandr
  • 49
  • 1
2

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.

I tried 3 approaches

  1. Reading and writing sequentially from the SQLite database
  2. Using a ThreadPoolExecutor to read/write
  3. Using a ProcessPoolExecutor to read/write

The results and takeaways from the benchmark are as follows

  1. Sequential reads/sequential writes work the best
  2. If you must process in parallel, use the ProcessPoolExecutor to read in parallel
  3. Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again

You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!

PirateApp
  • 5,433
  • 4
  • 57
  • 90
1

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net

which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.

Hope this helps your project... it should be simple enough to implement in 10 minutes.

code43
  • 11
  • 1
1

Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)

RexE
  • 17,085
  • 16
  • 58
  • 81
0

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:

  • Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
  • Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
James Brady
  • 27,032
  • 8
  • 51
  • 59
0

You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).

iny
  • 7,339
  • 3
  • 31
  • 36
0

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with

SELECT * FROM pragma_compile_options;

If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means

conn = sqlite3.connect("mydatabase.db", checksamethread=False)

That's explained in some detail in Python, SQLite, and thread safety

Hilton Fernandes
  • 559
  • 6
  • 11
-1

The most likely reason you get errors with locked databases is that you must issue

conn.commit()

after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).

An example of a correct and concurrent insertion would be this:

import threading, sqlite3
class InsertionThread(threading.Thread):

    def __init__(self, number):
        super(InsertionThread, self).__init__()
        self.number = number

    def run(self):
        conn = sqlite3.connect('yourdb.db', timeout=5)
        conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
        conn.commit()

        for i in range(1000):
            conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
            conn.commit()

# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
    t = InsertionThread(i)
    t.start()

If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

shezi
  • 560
  • 3
  • 18