39

There are many solutions to serialize a small dictionary: json.loads/json.dumps, pickle, shelve, ujson, or even by using sqlite.

But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.

redis is not really an option because it uses a client/server scheme.

Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?

I'm looking for a solution with a standard "Pythonic" d[key] = value syntax:

import mydb
d = mydb.mydb('myfile.db')
d['hello'] = 17          # able to use string or int or float as key
d[183] = [12, 14, 24]    # able to store lists as values (will probably internally jsonify it?)
d.flush()                # easy to flush on disk 

Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?


Linked questions: Use SQLite as a key:value store, Flat file NoSQL solution

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 3
    SQLite should work great. Did you have any problems using it? Its the DBMS that is small but the DB itself can be large. See https://stackoverflow.com/questions/14451624/will-sqlite-performance-degrade-if-the-database-size-is-greater-than-2-gigabytes – Himanshu Nov 11 '17 at 01:58
  • @Himanshu It's the fact the usage with SQLite is not as simple as `db[key] = value` or `db.put('key', 'value')`, but uses SQL instead... And I'd like to avoid INSERT into TABLE or SELECT ... for just a simple key:value `db[key] = value` set/get. – Basj Nov 11 '17 at 02:01
  • Can you describe the data more? 100 GB of what? How large is the smallest/median/largest value? How many key/value pairs make up the 100 GB? – John Zwinck Nov 11 '17 at 02:21
  • You may be able to get this working in dask but I've never actually used it, it's on my to-do. Apparently it runs on a single system too. Or you can always have MongoDB - there's nothing stopping you running that on localhost. I'm not sure what your requirement for serverless stems from, you might not have a choice for such large data stores on a single PC. – roganjosh Nov 11 '17 at 03:16
  • @JohnZwinck the keys are always 10 bytes, the values is a string of length 200 to 1000. It should be able to handle 100 millions keys/values for example – Basj Nov 11 '17 at 11:21
  • Thanks for the clarification @Basj. Can you also tell us a bit about the read/write patterns? Will you load all the data then query it repeatedly? Or will the data be updated randomly, or mostly appended to, while reads are ongoing? – John Zwinck Nov 11 '17 at 13:13
  • Yes, mostly appended (90 % of the time) and modified once in a while. – Basj Nov 11 '17 at 13:19
  • +1. You must create a record filter for this. You can define key names as individual folders and save them (the first letter of the key name). You must also write a summary of the contents of the folder in each record to another file (`/A/content.file`). You need to adjust the file size according to I/O speed and performance (if necessary, you can also assign the second letter as a folder (`/A/ab/`)). The file system design depends entirely on the type of data, length, access time, etc. I don't understand why you don't use a database. – dsgdfg Jan 18 '18 at 13:15
  • 2
    Python's [dbm](https://docs.python.org/3/library/dbm.html) module is perfect for this. I just checked and it doesn't load anything into memory and has an interface identical to a dict. I can't post an answer because it's closed, so I decided to post here. – Matthew D. Scholefield Jan 06 '20 at 20:37
  • @MatthewD.Scholefield A detailed answer with `dbm` would be interesting indeed. I voted to reopen, in case other people are interested. PS: I read https://github.com/python/cpython/blob/3.8/Lib/dbm/dumb.py but not sure if rewrites the whole DB if a few key: value pairs are modified... – Basj Apr 29 '20 at 10:47
  • @MatthewD.Scholefield It's reopened, it would be interesting to have your answer! (with some sample code if possible) – Basj Apr 29 '20 at 12:17

7 Answers7

42

You can use sqlitedict which provides key-value interface to SQLite database.

SQLite limits page says that theoretical maximum is 140 TB depending on page_size and max_page_count. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3 2.6.0), are page_size=1024 and max_page_count=1073741823. This gives ~1100 GB of maximal database size which fits your requirement.

You can use the package like:

from sqlitedict import SqliteDict

mydict = SqliteDict('./my_db.sqlite', autocommit=True)
mydict['some_key'] = any_picklable_object
print(mydict['some_key'])
for key, value in mydict.items():
    print(key, value)
print(len(mydict))
mydict.close()

Update

About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:

pip install lipsum psutil matplotlib psrecord sqlitedict

sqlitedct.py

#!/usr/bin/env python3

import os
import random
from contextlib import closing

import lipsum
from sqlitedict import SqliteDict

def main():
    with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:
        for _ in range(100000):
            v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
            d[os.urandom(10)] = v

if __name__ == '__main__':
    main()

Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!. In my case it produces this chart: chart

And database file:

$ du -h my_db.sqlite 
84M my_db.sqlite
saaj
  • 23,253
  • 3
  • 104
  • 105
  • Very nice benchmark, thank you @saaj! Being curious: what does `with closing(...) as ...:` do? – Basj Jan 18 '18 at 10:21
  • About the graph CPU y-axis between 0 and 200%, average 150%, is the y-axis correct? – Basj Jan 18 '18 at 10:22
  • @Basj 1) [`contextlib.closing`](https://docs.python.org/3/library/contextlib.html#contextlib.closing). 2) I think it is, as `sqlite3` creates own thread which releases GIL when operating in `_sqlite3` binary. So it gets over 100%. – saaj Jan 18 '18 at 12:25
  • Excellent answer, I updated mine accordingly https://stackoverflow.com/a/48298904/140837 – amirouche Feb 05 '18 at 03:19
  • 1
    There's also [diskcache](http://www.grantjenks.com/docs/diskcache/) which is pure-Python, requires no server, [fast](http://www.grantjenks.com/docs/diskcache/cache-benchmarks.html) and also built atop SQLite. The largest known diskcache database is 75GB. – GrantJ Nov 09 '18 at 22:09
10

LMDB (Lightning Memory-Mapped Database) is a very fast key-value store which has Python bindings and can handle huge database files easily.

There is also the lmdbm wrapper which offers the Pythonic d[key] = value syntax.

By default it only supports byte values, but it can easily be extended to use a serializer (json, msgpack, pickle) for other kinds of values.

import json
from lmdbm import Lmdb

class JsonLmdb(Lmdb):
  def _pre_key(self, value):
    return value.encode("utf-8")
  def _post_key(self, value):
    return value.decode("utf-8")
  def _pre_value(self, value):
    return json.dumps(value).encode("utf-8")
  def _post_value(self, value):
    return json.loads(value.decode("utf-8"))

with JsonLmdb.open("test.db", "c") as db:
  db["key"] = {"some": "object"}
  obj = db["key"]
  print(obj["some"])  # prints "object"

Some benchmarks. Batched inserts (1000 items each) were used for lmdbm and sqlitedict. Write performance suffers a lot for non-batched inserts for these because each insert opens a new transaction by default. dbm refers to stdlib dbm.dumb. Tested on Win 7, Python 3.8, SSD.

continuous writes in seconds

| items | lmdbm | pysos |sqlitedict|   dbm   |
|------:|------:|------:|---------:|--------:|
|     10| 0.0000| 0.0000|   0.01600|  0.01600|
|    100| 0.0000| 0.0000|   0.01600|  0.09300|
|   1000| 0.0320| 0.0460|   0.21900|  0.84200|
|  10000| 0.1560| 2.6210|   2.09100|  8.42400|
| 100000| 1.5130| 4.9140|  20.71700| 86.86200|
|1000000|18.1430|48.0950| 208.88600|878.16000|

random reads in seconds

| items | lmdbm | pysos |sqlitedict|  dbm   |
|------:|------:|------:|---------:|-------:|
|     10| 0.0000|  0.000|    0.0000|  0.0000|
|    100| 0.0000|  0.000|    0.0630|  0.0150|
|   1000| 0.0150|  0.016|    0.4990|  0.1720|
|  10000| 0.1720|  0.250|    4.2430|  1.7470|
| 100000| 1.7470|  3.588|   49.3120| 18.4240|
|1000000|17.8150| 38.454|  516.3170|196.8730|

For the benchmark script see https://github.com/Dobatymo/lmdb-python-dbm/blob/master/benchmark.py

C. Yduqoli
  • 1,706
  • 14
  • 18
  • Thank you for your answer. Can you include sample code (with imports etc.) in your answer, so that it's better for future reference and for people who want to try quickly this solution. – Basj Nov 06 '20 at 18:45
  • Can you also include some benchmarking? Thank you for sharing this library! – 0x90 Nov 11 '20 at 02:16
  • I tried a mini benchmark here https://gist.github.com/dagnelies/eae73f7341ef00068c3d27cd488f33bc but performance looked quite disastrous. Did I do something wrong? Inserting 100k small key/value pairs took LMDB whopping 6 minutes!?! ...as a baseline comparison, pysos does it in ~4 seconds. – dagnelies Nov 25 '20 at 08:59
  • 1
    @dagnelies You might want to use .update() for batch inserts, otherwise each insert will be one transaction. This should give you at least 10 times the performance. LMDB is a full featured database with ACID transactions and concurrent access. Also it doesn't store all keys in memory. Writes are thus more expensive. Your benchmark does not concern read performance, memory usage, crash safety or concurrency. When I have time, I will try to add some more comprehensive comparisons. – C. Yduqoli Nov 26 '20 at 06:37
  • 1
    Thanks for this great answer! I am working on a 300 million items one-to-many lookup table. I tried `sqlitedict`, `wiredtiger`, `pysos` and `lmdb`. I can confirm using `Lmdb.update()` is key for a decent write performance. Inserting elements one-by-one, `lmdb` was able to write ~1k items per second, for `sqlitedict` and `pysos` it was ~3k/s and ~80k/s, respectively. With `wiredtiger` I got some tough errors and the support is sparse. Finally, inserting 100k at a time into `lmdb` reaches an impressing 100-500k/s speed. Also it's light on memory, unlike some of the other libraries here. – deeenes Sep 17 '21 at 12:57
7

I would consider HDF5 for this. It has several advantages:

  • Usable from many programming languages.
  • Usable from Python via the excellent h5py package.
  • Battle tested, including with large data sets.
  • Supports variable-length string values.
  • Values are addressable by a filesystem-like "path" (/foo/bar).
  • Values can be arrays (and usually are), but do not have to be.
  • Optional built-in compression.
  • Optional "chunking" to allow writing chunks incrementally.
  • Does not require loading the entire data set into memory at once.

It does have some disadvantages too:

  • Extremely flexible, to the point of making it hard to define a single approach.
  • Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g. h5py).
  • Baroque C/C++ API (the Python one is not so).
  • Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.

You can think of HDF5 as a way to store values (scalars or N-dimensional arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 4
    HDF5 is not a database, it's serialization format. It's requires to load the whole in memory. – amirouche Jan 17 '18 at 10:25
  • 1
    Thank you. Can you include 3 or 4 lines of code showing how to use it, like in the (edited) question? i.e. `import ...` then creation of DB then `d[key] = value` then flush it to disk. – Basj Jan 17 '18 at 11:52
  • 8
    @amirouche: Obviously HDF5 is not a database. The question did not ask for a database. HDF5 does not require loading the whole of anything into memory--you can load slices, "hyperslabs", single arrays or attributes in a hierarchical file, etc. It absolutely does not require loading anything more than you want into memory. Anyway OP's data is on the order of 100 GB, and 100 GB of main memory is easily found on commodity servers and even some desktops these days. – John Zwinck Jan 17 '18 at 12:59
  • 1
    @Basj: Please see h5py's excellent Quick Start tutorial here: http://docs.h5py.org/en/latest/quick.html - it has code for exactly what you want. – John Zwinck Jan 17 '18 at 13:01
  • Would this be optimal for text data? – SantoshGupta7 May 26 '19 at 20:03
6

The shelve module in the standard library does just that:

import shelve
with shelve.open('myfile.db') as d:
    d['hello'] = 17  # Auto serializes any Python object with pickle
    d[str(183)] = [12, 14, 24]  # Keys, however, must be strings
    d.sync()  # Explicitly write to disc (automatically performed on close)

This uses the python dbm module to save and load data from disk without loading the entire thing.

Example with dbm:

import dbm, json
with dbm.open('myfile2.db', 'c') as d:
    d['hello'] = str(17)
    d[str(183)] = json.dumps([12, 14, 24])
    d.sync()

However, there are two considerations when using shelve:

  • It uses pickle for serialization. What this means is the data is coupled with Python and possibly the python version used to save the data. If this is a concern, the dbm module can be used directly (same interface, but only strings can be used as keys/values).
  • The Windows implementation seems to have bad performance

For this reason, the following third party options copied from here would be good options:

  • semidb - Faster cross platform dbm implementation
  • UnQLite - More feature-filled serverless database
  • More mentioned in the link
Basj
  • 41,386
  • 99
  • 383
  • 673
Matthew D. Scholefield
  • 2,977
  • 3
  • 31
  • 42
  • 1
    Thanks for your answer! I added a dbm example for future reference, I hope it's ok for you. – Basj Apr 29 '20 at 13:16
5

I know it's an old question, but I wrote something like this long ago:

https://github.com/dagnelies/pysos

It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.

To install:

pip install pysos

Usage:

import pysos
db = pysos.Dict('somefile')
db['hello'] = 'persistence!'

EDIT: Performance

Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):

import pysos
t = time.time()
import time
N = 100 * 1000
db = pysos.Dict("test.db")
for i in range(N):
    db["key_" + str(i)] = {"some": "object_" + str(i)}
db.close()

print('PYSOS time:', time.time() - t)
# => PYSOS time: 3.424309253692627

The resulting file was about 3.5 Mb big. ...So, very roughly speeking, you could insert 1 mb of data per second.

EDIT: How it works

It writes every time you set a value, but only the key/value pair. So the cost of adding/updating/deleting an item is always the same, although adding only is "better" because lots of updating/deleting leads to data fragmentation in the file (wasted junk bytes). What is kept in memory is the mapping (key -> location in the file), so you just have to ensure there is enough RAM for all those keys. SSD is also highly recommended. 100 MB is easy and fast. 100 GB like posted originally will be a lot, but doable. Even raw reading/writing 100 GB takes quite some time.

dagnelies
  • 5,203
  • 5
  • 38
  • 56
  • Nice, I'm going to try it! When is it written to disk? Do we need to do db.flush() every now and then? Also, if we have 100 MB data in it, does resaving to disk imply rewriting all 100 MB or only what has changed since last write? – Basj Jun 21 '20 at 08:57
  • it writes every time you set a value, but only the key/value pair. So the cost of adding/updating/deleting an item is always the same, although adding only is "better" because lots of updating/deleting leads to data fragmentation in the file (wasted junk bytes). What is kept in memory is the mapping (key -> location in the file), so you just have to ensure there is enough RAM for all those keys. SSD is also highly recommended. 100 MB is easy and fast. 100 GB like posted originally will be a lot, but doable. Even raw reading/writing 100 GB takes quite some time. – dagnelies Jun 23 '20 at 07:31
  • Thank you. Could you include this useful information in the answer for future reference @dagnelies? because this is really the important part. – Basj Jun 23 '20 at 07:33
  • How is pysos different from https://docs.python.org/3/library/dbm.html#module-dbm.dumb ? It seems to work the same way (keeping only the mapping in memory, write every time, ...) – C. Yduqoli Nov 18 '20 at 08:36
  • 1
    @C.Yduqoli: if I remember right, pysos had by far better performances than the "dumb" dbm at the time which sucked badly. However, please keep in mind that this was 5 years ago, things might have evolved during this time. It would be nice if you make a small benchmark and post the results. Aside from that pysos also encodes your key/value objects as json and the "db" file is readable, while dbm deals with bytes as key/values encoded in binary files. There are a few more details where they probably differ, but both share the similarity of being a key/value store at their core. – dagnelies Nov 19 '20 at 20:01
3

First, bsddb (or under it's new name Oracle BerkeleyDB) is not deprecated.

From experience LevelDB / RocksDB / bsddb are slower than wiredtiger, that's why I recommend wiredtiger.

wiredtiger is the storage engine for mongodb so it's well tested in production. There is little or no use of wiredtiger in Python outside my AjguDB project; I use wiredtiger (via AjguDB) to store and query wikidata and concept which around 80GB.

Here is an example class that allows mimick the python2 shelve module. Basically, it's a wiredtiger backend dictionary where keys can only be strings:

import json

from wiredtiger import wiredtiger_open


WT_NOT_FOUND = -31803


class WTDict:
    """Create a wiredtiger backed dictionary"""

    def __init__(self, path, config='create'):
        self._cnx = wiredtiger_open(path, config)
        self._session = self._cnx.open_session()
        # define key value table
        self._session.create('table:keyvalue', 'key_format=S,value_format=S')
        self._keyvalue = self._session.open_cursor('table:keyvalue')

    def __enter__(self):
        return self

    def close(self):
        self._cnx.close()

    def __exit__(self, *args, **kwargs):
        self.close()

    def _loads(self, value):
        return json.loads(value)

    def _dumps(self, value):
        return json.dumps(value)

    def __getitem__(self, key):
        self._session.begin_transaction()
        self._keyvalue.set_key(key)
        if self._keyvalue.search() == WT_NOT_FOUND:
            raise KeyError()
        out = self._loads(self._keyvalue.get_value())
        self._session.commit_transaction()
        return out

    def __setitem__(self, key, value):
        self._session.begin_transaction()
        self._keyvalue.set_key(key)
        self._keyvalue.set_value(self._dumps(value))
        self._keyvalue.insert()
        self._session.commit_transaction()

Here the adapted test program from @saaj answer:

#!/usr/bin/env python3

import os
import random

import lipsum
from wtdict import WTDict


def main():
    with WTDict('wt') as wt:
        for _ in range(100000):
            v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
            wt[os.urandom(10)] = v

if __name__ == '__main__':
    main()

Using the following command line:

python test-wtdict.py & psrecord --plot=plot.png --interval=0.1 $!

I generated the following diagram:

wt performance without wal

$ du -h wt
60M wt

When write-ahead-log is active:

wt performance with wal

$ du -h wt
260M    wt

This is without performance tunning and compression.

Wiredtiger has no known limit until recently, the documentation was updated to the following:

WiredTiger supports petabyte tables, records up to 4GB, and record numbers up to 64-bits.

http://source.wiredtiger.com/1.6.4/architecture.html

amirouche
  • 7,682
  • 6
  • 40
  • 94
  • Thanks. Can you give an example of code using `wiredtiger`? Is it possible to use it with a simple API like `import wiredtiger` `wt = wiredtiger.wiredtiger('myfile.db')` `wt['hello'] = 17` `wt[183] = [12, 14, 24]` `wt.flush()` i.e. the main requirements are: 1) `wt[key] = value` syntax 2) able to use string or int or float as key 3) able to store lists as values 4) easy to flush on disk – Basj Jan 17 '18 at 11:38
  • Is it for Python 2 or Python 3? – amirouche Jan 17 '18 at 13:20
  • Using the wiredtiger engine is a cool idea. However the Python bindings seem to be maintained poorly. – C. Yduqoli Dec 08 '20 at 07:36
  • @C.Yduqoli what are the python bindings you are referring to? They are maintained by mongodb. – amirouche Dec 11 '20 at 17:57
  • 1
    @amirouche the official ones here https://pypi.org/project/wiredtiger/ There is no windows support for example, although the wiredtiger C code does support Windows. – C. Yduqoli Dec 14 '20 at 00:52
  • @C.Yduqoli I had some trouble with wiredtiger, "fine tuning" is very difficult, it as if the database is leaking memory. The documentation even recommends as much RAM as stored data. Anyway, with Python you will not be able to take advantage of POSIX threads. I recommend you give sqlite lsm extension that can be built standalone as a shared library from sqlite3 repository OR foundationdb (even if a different beast...) – amirouche Dec 20 '20 at 14:09
3

Another solution that is worth taking a look on is DiskCache's Index (API docs). It's atomic, thread and process-safe and it has transactions (see features comparison here).

Epoc
  • 7,208
  • 8
  • 62
  • 66