20

I'm running a Python program which uses the shelve module on top of pickle. After running this program sometimes I get one output file as a.data but at other times I get three output files as a.data.bak, a.data.dir and a.data.dat.

Why is that?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ali_IT
  • 7,551
  • 8
  • 28
  • 44
  • It's probably the program doing that. The [shelve](http://hg.python.org/cpython/file/e0c0bcd60033/Lib/shelve.py) module itself doesn't do anything like this. – mata Apr 23 '13 at 16:11
  • 4
    `"As a side-effect, an extension may be added to the filename and more than one file may be created."` [(c)](http://docs.python.org/2/library/shelve.html#shelve.open) This doesn't answer _why_, though. – Lev Levitsky Apr 23 '13 at 17:09

1 Answers1

36

There is quite some indirection here. Follow me carefully.

The shelve module is implemented on top of the dbm module. This module acts as a facade for 3(* different specific DBM implementations, and it will pick the first module available when creating a new database, in the following order:

  1. dbm.gnu, Python module for the GNU DBM library; you would use it directly if you needed the extra functionality it offers over the base dbm module (it lets you iterate over the keys in stored order and 'pack' the database to free up space from deleted objects).
  2. dbm.ndbm, a proxy module using either the ndbm, BSD DB and GNU DBM libraries (choosen when Python is compiled).
  3. dbm.dumb, a pure-python implementation.

It is this range of choices that makes shelve files appear to grow extra extensions on different platforms.

The dbm.dumb module is the one that adds the .bak, .dat and .dir extensions:

Open a dumbdbm database and return a dumbdbm object. The filename argument is the basename of the database file (without any specific extensions). When a dumbdbm database is created, files with .dat and .dir extensions are created.

The .dir file is moved to .bak as new index dicts are committed for changes made to the data structures (when adding a new key, deleting a key, or by calling .sync() or .close()).

It means that the other three options for anydbm are not available on your platform.

The other formats may give you other extensions. The dbm module may use .dir, .pag or .db, depending on what library was used for that module.


(* Python 2 had four dbm modules, it would default to the deprecated dbhash module, which in turn was built on top of the bsddb module. These were both removed from Python 3.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you for the answer. just one other thing. Do you know how can I force my program to use dumbdbm not any other modules for databases? – Ali_IT Apr 26 '13 at 22:13
  • 4
    Create two empty files with the `.dir` and `.dat` extensions if they do not yet exist, after which `anydbm` will assume there is already a `dumbdbm` database there and use the `dumbdbm` module. – Martijn Pieters Apr 26 '13 at 23:40
  • This makes sense.. but at the same time it's kind of horrible and "non-pythonic".. I would prefer to have it stored as plain text in a "objects.jsons" kind of file. Anyone who needs performance won't be using this anyways. But these multiple files take away from the clarity. – avloss Jul 02 '18 at 13:53
  • 2
    @avloss: Why is it unpythonic? You get a single API that gives you the same kind of storage across a wide range of platforms. The extensions of the resulting files are an implementation detail. `objects.json` is not going to be as easy to use as `shelve` is. – Martijn Pieters Jul 02 '18 at 17:22
  • Of more concern there is the question, how safe is it to delete .bak one manually, or why dumdbdm is not doing that automatically at some point (when writing process closes shelve object, for example) or at some method call. Or at least under what condition can .bak be safely deleted? – Anatoly Alekseev Apr 04 '21 at 14:16
  • 1
    @AnatolyAlekseev why is that a concern? The file is a copy of the last state of the `.dir` file and can be deleted whenever you feel the storage is still in a 'sane' state. That's not for `dbm.dumb` to decide, as it can't foresee crashes or bugs. Instead, it replaces the `.bak` file each time a new `.dir` file is written. – Martijn Pieters Apr 05 '21 at 12:55
  • @martijn-pieters Thanks, yes I thought so as well. I have 2 processes, one writer, and one reader. According to docs, multiple writers are not supported, but concurrent readers should be ok. But it seems that sometimes just opening the same shelve object for reading leads to losing data, I though it could be due to some problems with such files (not being synched properly, or being overwritten). Not sure yet, need to watch more. – Anatoly Alekseev Apr 05 '21 at 16:10
  • 1
    @AnatolyAlekseev: note that when you _replace_ the value for a key, the `.dir` file is not synched. Call `.sync()` explicitly in such cases. – Martijn Pieters Apr 06 '21 at 21:32
  • @AnatolyAlekseev: however, for multiple readers, I'd use a sqlite database instead. – Martijn Pieters Apr 06 '21 at 21:32
  • Yes I'm also thinking to replace shelve temp solution with smth like redis in production. – Anatoly Alekseev Apr 07 '21 at 17:10