13

I'd like a unique dict (key/value) database to be accessible from multiple Python scripts running at the same time.

If script1.py updates d[2839], then script2.py should see the modified value when querying d[2839] a few seconds after.

What would be a Pythonic solution for this?

Note: I'm on Windows, and the dict should have maximum 1M items (key and value both integers).

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 3
    Nice question. I'm looking forward for a proper solution too. Meanwhile, have a look at multiprocessing [Manager](https://docs.python.org/3.6/library/multiprocessing.html#managers) (only suitable if you can start both scripts from a single entry point). Also as a last resort option you can run redis in Docker. – 9dogs Jan 23 '18 at 20:07
  • @9dogs Yes a solution for this would be very useful in many situations :) Note: I don't start all scripts from the same entry point / not using `multiprocessing`. I'm looking for a more lightweight solution than running redis in Docker, but I'll remember this option in last resort! – Basj Jan 23 '18 at 20:12
  • Wait, why would SQLite not work? – juanpa.arrivillaga Jan 23 '18 at 20:16
  • @juanpa.arrivillaga I read that SQLite is not made for concurrent write / read from multiple programs at the same time. Is it not true anymore? – Basj Jan 23 '18 at 20:19
  • @Basj. See: [sqlite3 concurrent access](https://stackoverflow.com/q/4060772/984421). It should be adequate for your requirements. – ekhumoro Jan 23 '18 at 20:39
  • @ekhumoro Interesting topic! I started a bounty [there](https://stackoverflow.com/questions/4060772/sqlite3-concurrent-access), because imho, the answers are not totally precise (they say more or less "it's ok" without speaking about with the real problematic cases: two write operations at the same time, etc.) – Basj Jan 23 '18 at 20:54
  • @Basj how many processes would access the shared data concurrently? – georgexsh Jan 26 '18 at 10:12
  • @georgexsh 4 processes (let's say < 10 in general). I would like to be 100,00000% sure that no data is lost because we're in the very rare case where the 2 write operations would be done at the same time, and that the 2nd write operation overwrites the first one at the same time + 1µs. – Basj Jan 26 '18 at 10:14
  • @juanpa.arrivillaga [this is the reason](https://stackoverflow.com/questions/48459493/sqlite-concurrency-the-2nd-process-doesnt-get-db-updates) why I thought SQLite would not work. Any idea about how to solve it? – Basj Jan 26 '18 at 10:34
  • @Basj moved the answer here https://stackoverflow.com/questions/48459493/sqlite-concurrency-the-2nd-process-doesnt-get-db-updates/48469535, so the sqlite would be ok in the simplest scenario (1 writer and multiple readers). BUT if you need multiple writers with high concurrency of writes everything could be trickier – Oleg Kuralenko Jan 26 '18 at 21:10
  • How about trying etcd on windows? https://github.com/coreos/etcd/blob/master/Documentation/op-guide/supported-platform.md – Tarun Lalwani Jan 27 '18 at 05:39
  • Redis would be a superb, fast solution for you - it is a very fast, in-memory data structure server http://redis.io – Mark Setchell Jan 27 '18 at 08:45
  • And that why have databases, common / shared /distributed/concurrent access. – Deepak Verma Feb 01 '18 at 11:50
  • Redis is "not officially supported" but the MS OpenTech port works great. It's probably the easiest and best tool for what you need. – joshua Feb 02 '18 at 06:05

8 Answers8

11

Mose of embedded datastore other than SQLite doesn't have optimization for concurrent access, I was also curious about SQLite concurrent performance too, so I did a benchmark:

import time
import sqlite3
import os
import random
import sys
import multiprocessing


class Store():

    def __init__(self, filename='kv.db'):
        self.conn = sqlite3.connect(filename, timeout=60)
        self.conn.execute('pragma journal_mode=wal')
        self.conn.execute('create table if not exists "kv" (key integer primary key, value integer) without rowid')
        self.conn.commit()

    def get(self, key):
        item = self.conn.execute('select value from "kv" where key=?', (key,))
        if item:
            return next(item)[0]

    def set(self, key, value):
        self.conn.execute('replace into "kv" (key, value) values (?,?)', (key, value))
        self.conn.commit()


def worker(n):
    d = [random.randint(0, 1<<31) for _ in range(n)]
    s = Store()
    for i in d:
        s.set(i, i)
    random.shuffle(d)
    for i in d:
        s.get(i)


def test(c):
    n = 5000
    start = time.time()
    ps = []
    for _ in range(c):
        p = multiprocessing.Process(target=worker, args=(n,))
        p.start()
        ps.append(p)
    while any(p.is_alive() for p in ps):
        time.sleep(0.01)
    cost = time.time() - start
    print(f'{c:<10d}\t{cost:<7.2f}\t{n/cost:<20.2f}\t{n*c/cost:<14.2f}')


def main():
    print(f'concurrency\ttime(s)\tpre process TPS(r/s)\ttotal TPS(r/s)')
    for c in range(1, 9):
        test(c)


if __name__ == '__main__':
    main()

result on my 4 cores macOS box, SSD volume:

concurrency time(s) pre process TPS(r/s)    total TPS(r/s)
1           0.65    7638.43                 7638.43
2           1.30    3854.69                 7709.38
3           1.83    2729.32                 8187.97
4           2.43    2055.25                 8221.01
5           3.07    1629.35                 8146.74
6           3.87    1290.63                 7743.78
7           4.80    1041.73                 7292.13
8           5.37    931.27                  7450.15

result on an 8 cores windows server 2012 cloud server, SSD volume:

concurrency     time(s) pre process TPS(r/s)    total TPS(r/s)
1               4.12    1212.14                 1212.14
2               7.87    634.93                  1269.87
3               14.06   355.56                  1066.69
4               15.84   315.59                  1262.35
5               20.19   247.68                  1238.41
6               24.52   203.96                  1223.73
7               29.94   167.02                  1169.12
8               34.98   142.92                  1143.39

turns out overall throughput is consistent regardless of concurrency, and SQLite is slower on windows than macOS, hope this is helpful.


As SQLite write lock is database wise, in order to get more TPS, you could partition data to multi-database files:

class MultiDBStore():

    def __init__(self, buckets=5):
        self.buckets = buckets
        self.conns = []
        for n in range(buckets):
            conn = sqlite3.connect(f'kv_{n}.db', timeout=60)
            conn.execute('pragma journal_mode=wal')
            conn.execute('create table if not exists "kv" (key integer primary key, value integer) without rowid')
            conn.commit()
            self.conns.append(conn)

    def _get_conn(self, key):
        assert isinstance(key, int)
        return self.conns[key % self.buckets]

    def get(self, key):
        item = self._get_conn(key).execute('select value from "kv" where key=?', (key,))
        if item:
            return next(item)[0]

    def set(self, key, value):
        conn = self._get_conn(key)
        conn.execute('replace into "kv" (key, value) values (?,?)', (key, value))
        conn.commit()

result on my mac with 20 partitions:

concurrency time(s) pre process TPS(r/s)    total TPS(r/s)
1           2.07    4837.17                 4837.17
2           2.51    3980.58                 7961.17
3           3.28    3047.68                 9143.03
4           4.02    2486.76                 9947.04
5           4.44    2249.94                 11249.71
6           4.76    2101.26                 12607.58
7           5.25    1903.69                 13325.82
8           5.71    1752.46                 14019.70

total TPS is higher than single database file.

georgexsh
  • 15,984
  • 2
  • 37
  • 62
  • Thank you very much. Because `SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time.` ([reference here](https://stackoverflow.com/a/48512801/1422096)), isn't there the need to add something to code to allow multiple writers and to be sure the writes operation happening at the same time don't overwrite each other? – Basj Jan 30 '18 at 09:11
  • @Basj I added a database partition implementation. – georgexsh Jan 30 '18 at 17:18
  • Thanks @georgexsh. Isn't there a way to share *the same DB file* among multiple processes, and all of them can write/read? (maybe using a lock?). (I forgot to mention this in the question but it was the idea: share a dict with multiple scripts, and many of them can write in the DB, without partitionning). – Basj Jan 30 '18 at 17:26
  • @Basj IIRC no embedded database engine support concurrent update on same database file because there no central server process to coordinate. What you should be concerned is the throughput db engine could provide and your workload requirement, and that is what this benchmark for. – georgexsh Jan 31 '18 at 09:07
  • We can imagine @georgexsh that if we are in rare case where exactly 2 write operations want to happen at the exact same time, then the 2nd one has to wait for the 1st one to be finished (with a lock system). How to do this with `sqlite3`? – Basj Jan 31 '18 at 09:16
  • @Basj SQLite engine could [handle this well by itself](https://www.sqlite.org/wal.html). – georgexsh Jan 31 '18 at 09:18
  • I don't think so @georgexsh, I see `Writers merely append new content to the end of the WAL file. Because writers do nothing that would interfere with the actions of readers, writers and readers can run at the same time. However, **since there is only one WAL file, there can only be one writer at a time**.` – Basj Jan 31 '18 at 09:46
  • ACID @georgexsh, and if possible with simple code ;) (sounds impossible, but with Python we are used to that hehe!). Maybe just a parameter that enables locking? – Basj Jan 31 '18 at 10:37
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164297/discussion-between-georgexsh-and-basj). – georgexsh Jan 31 '18 at 20:29
  • @basj no idea why you worry about ACID-compliant, that's the most important feature of SQLite, is there anything unclear to you about [its doc](https://www.sqlite.org/wal.html)? – georgexsh Jan 31 '18 at 20:41
  • What is unclear @georgexsh is that it mentions `SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time` so there is the need for a solution that works for multiple writers. A bad, but working solution could be `if file_currently_being_written_by_another_writer(): time.sleep(1) retry_to_write()". It's unclear if this is done automatically by `sqlite3` or if I should do it manually. – Basj Jan 31 '18 at 20:51
  • 1
    @Basj ACID-compliant is what SQLite made for, the last thing you should worry about... anyway, you could check `sqlite3WalBeginWriteTransaction()` in [wal.c](https://www.sqlite.org/src/artifact/5a3f464edd64596f), the writer will wait for `WAL_WRITE_LOCK` at the beginning of a transaction, I guess that the "locking" part you want to see, but the complete mechanism is more complex than that. – georgexsh Feb 01 '18 at 09:19
  • 2
    @Basj - the "only allow one writer at any instant" is _because sqlite handles concurrency for you_. If one process is writing, another process trying to write won't fail, it will block. You don't need to do anything for concurrent writes to be fine. [This document](https://sqlite.org/lockingv3.html#writing) describes the internals; [this one about the python library](https://docs.python.org/2/library/sqlite3.html#sqlite3.connect) should reassure you. – Nathan Vērzemnieks Feb 01 '18 at 17:44
  • Ohhhhhhhh then it's big news @NathanVērzemnieks. Maybe I misunderstood `only allow one writer at any instant`. I understood this sentence (thus I posted a few questions on SO about that) as *"Don't try to use several scripts writing the same database, it will fail! Data will be lost! We only allow one writer at any instant!"* whereas you say it is more "SQLite will do everything for you *to ensure that only one writer writes at the same time*, so that it will work". Is that right? – Basj Feb 01 '18 at 19:29
  • 1
    @Basj yes, sqlite will do it for you internally. – georgexsh Feb 02 '18 at 01:25
  • @Basj is there anything I could improve? – georgexsh Feb 05 '18 at 20:45
4

Before there was redis there was Memcached (which works on windows). Here is a tutorial. https://realpython.com/blog/python/python-memcache-efficient-caching/

Back2Basics
  • 7,406
  • 2
  • 32
  • 45
  • The title gave me a hint as to the authors intention, "Share a dict with other python scripts" which doesn't need a datastore in order to do that. While a database was mentioned several times the act of storing the data might have been a side benefit that might not have been needed. – Back2Basics Feb 03 '18 at 08:37
2

I'd consider 2 options, both are embedded databases

SQlite

As answered here and here it should be fine

BerkeleyDB

link

Berkeley DB (BDB) is a software library intended to provide a high-performance embedded database for key/value data

It has been designed exactly for your purpose

BDB can support thousands of simultaneous threads of control or concurrent processes manipulating databases as large as 256 terabytes,3 on a wide variety of operating systems including most Unix-like and Windows systems, and real-time operating systems.

It is robust and has been around for years if not decades

Bringing up redis/memcached/ whatever else full-fledged socket-based server that requires sysops involvement IMO is an overhead for the task to exchange data between 2 scripts located on the same box

Oleg Kuralenko
  • 11,003
  • 1
  • 30
  • 40
2

You can use python dictionary for this purpose.

Create a generic class or script named as G, that initializes a dictionary in it. The G will run the script1.py & script2.py and passes the dictionary to both scripts file, in python dictionary is passed by reference by default. In this way, a single dictionary will be used to store data and both scripts can modify dictionary values, changes can be seen in both of the scripts. I hope script1.py and script2.py are class based. It doesn't guarantee the persistence of data. For persistence, you can store the data in the database after x intervals.

Example

script1.py

class SCRIPT1:

    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.dictionary.update({"a":"a"})
        print("SCRIPT1 : ", self.dictionary)

    def update(self):
        self.dictionary.update({"c":"c"})          

script2.py

class SCRIPT2:
    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.dictionary.update({"b":"b"})
        print("SCRIPT 2 : " , self.dictionary)

main_script.py

import script1
import script2

x = {}

obj1 = script1.SCRIPT1(x) # output: SCRIPT1 :  {'a': 'a'}
obj2 = script2.SCRIPT2(x) # output: SCRIPT 2 :  {'a': 'a', 'b': 'b'}
obj1.update()
print("SCRIPT 1 dict: ", obj1.dictionary) # output: SCRIPT 1 dict:  {'c': 'c', 'a': 'a', 'b': 'b'}

print("SCRIPT 2 dict: ", obj2.dictionary) # output: SCRIPT 2 dict:  {'c': 'c', 'a': 'a', 'b': 'b'}

Also create an empty _ init _.py file in the directory where you will run the scripts.

Another option is:

Redis

Irtiza
  • 173
  • 4
  • 16
  • Thank you for your answer, please post a code example to make it more understandable. – Basj Feb 01 '18 at 14:52
  • 1
    What OP meant by "multiple Python scripts running at the same time" was "with multiple interpreters instances". – Darkonaut Feb 01 '18 at 22:46
  • @Darkonaut i don't understand your question, kindly explain it little more – Irtiza Feb 02 '18 at 06:30
  • It was not a question but your answer does not what the OP asked for. Everything in your code runs sequentially on one interpreter instance (one process). It's just code from two other scripts being imported, not running other processes, much less sharing something between processes. – Darkonaut Feb 02 '18 at 16:04
0

You could use a document-based database manager. Maybe is too heavy for your system to do so, but concurrent access is typically one of the reasons DB management systems and API to connect to them are in place.

I have used MongoDB with Python and it works fine. The Python API documentation is quite good and each document (element of the database) is a dictionary that can be loaded to python as such.

Martín Gómez
  • 338
  • 2
  • 9
0

I would use a pub/sub websocket-framework, like Autobahn/Python, with one script as a "server" and it handles all the file communication but it depends on scale maybe this could be Overkill.

0

CodernintyDB could be worth exploring, using the server version.

http://labs.codernity.com/codernitydb/

Server version: http://labs.codernity.com/codernitydb/server.html

Marcin
  • 1,889
  • 2
  • 16
  • 20
0

It sounds like you really need is a database of some kind.

If redis won't work for windows, then I would look at MongoDB.

https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows/

MongoDB works great with python and can function similar to redis. Here are the install docs for PyMongo: http://api.mongodb.com/python/current/installation.html?_ga=2.78008212.1422709185.1517530606-587126476.1517530605

Also, many people have brought up SQlite. I think you were concerned that it only allows one writer at a time, but this is not really a problem for you to worry about. I think what it is saying is that, if there are two writers, the second will be blocked until the first is finished. This is probably fine for your situation.

ryati
  • 360
  • 3
  • 13