6

I was thinking that native DBM of Python should be quite faster than NOSQL databases such as Tokyo Cabinet, MongoDB, etc (as Python DBM has lesser features and options; i.e. a simpler system). I tested with a very simple write/read example as

#!/usr/bin/python
import time
t = time.time()
import anydbm
count = 0
while (count < 1000):
 db = anydbm.open("dbm2", "c")
 db["1"] = "something"
 db.close()
 db = anydbm.open("dbm", "r")
 print "dict['Name']: ", db['1'];
 print "%.3f" % (time.time()-t)
 db.close()
 count = count + 1

Read/Write: 1.3s Read: 0.3s Write: 1.0s

These values for MongoDb is at least 5 times faster. Is it really the Python DBM performance?

Googlebot
  • 15,159
  • 44
  • 133
  • 229
  • How are you timing your application? – Simone Oct 12 '11 at 09:58
  • with time() timer before the loop and at the end of each cycle. – Googlebot Oct 12 '11 at 10:10
  • 1
    So you are also timing the `print` and garbage collection that Python does behind. You should also switch to clock(), that - at least on Windows platform - is more accurate. – Simone Oct 12 '11 at 10:18
  • I am testing others with the same system. I believe the execution time for time() is negligible. I have tested the databases with PHP too. All on Linux platform. – Googlebot Oct 12 '11 at 10:45

2 Answers2

15

Python doesn't have a built-in DBM implementation. It bases its DBM functions on a wide range of DBM-style third party libraries, like AnyDBM, Berkeley DBM and GNU DBM.

Python's dictionary implementation is really fast for key-value storage, but not persistent. If you need high-performance runtime key-value lookups, you may find a dictionary better - you can manage persistence with something like cpickle or shelve. If startup times are important to you (and if you're modifying the data, termination) - more important than runtime access speed - then something like DBM would be better.

In your evaluation, as part of the main loop you have included both dbm open calls and also array lookup. It's a pretty unrealistic use case to open a DBM to store one value and the close and re-open before looking it up, and you're seeing the typical slow performance that one would when managing a persistent data store in such a manner (it's quite inefficient).

Depending on your requirements, if you need fast lookups and don't care too much about startup times, DBM might be a solution - but to benchmark it, only include writes and reads in the loop! Something like the below might be suitable:

import anydbm
from random import random
import time

# open DBM outside of the timed loops
db = anydbm.open("dbm2", "c")

max_records = 100000

# only time read and write operations
t = time.time()

# create some records
for i in range(max_records):
  db[str(i)] = 'x'

# do a some random reads
for i in range(max_records):
  x = db[str(int(random() * max_records))]

time_taken = time.time() - t
print "Took %0.3f seconds, %0.5f microseconds / record" % (time_taken, (time_taken * 1000000) / max_records)

db.close()
Leon Derczynski
  • 542
  • 5
  • 15
  • As a matter of fact, I tried Python Dictionary (with pickle); but its performance is not good in updating very large dataset (e.g. 2GB) in reading/writing the database file. Do you have any subtle suggestion to handle this? – Googlebot Oct 12 '11 at 11:28
  • 2
    It'll be slow, no doubt about it - although cPickle is usually substantially faster than Pickle. cPickle comes bundled with most Python packages, and is almost 100% compatible with Pickle; you can possibly just replace 'pickle' with 'cpickle' in your code.For a key:value data structure of that size, a dbm implementation or even sqlite will probably have lower startup times. – Leon Derczynski Oct 12 '11 at 12:07
  • Thanks for clarification. Then, for a large database needing regular update, you suggest a database like GDBM? – Googlebot Oct 12 '11 at 12:35
  • 3
    Yes, if you have just key:value pairs, it's probably the simplest solution available for your situation. You might like to benchmark the various DBMs to see which works fine; AnyDBM is described as the most portable Python DBM, which makes it good for examples and learning, but it might not be the best for you. – Leon Derczynski Oct 12 '11 at 13:03
  • Then what can be the best choice? Sorry for taking this discussion too long :) I've tried SQLite, MongoDB, Tokyo Cabinet/Tyrant. But a simple database of AnyDBM class fits better due to its simplicity. What can be the best one in this class? – Googlebot Oct 12 '11 at 13:25
  • The only way to be sure is to measure. Here are a good range of choices: http://docs.python.org/library/persistence.html :) – Leon Derczynski Oct 12 '11 at 15:24
3

Embedded keystores are faster enough in python3. Take as baseline the native dict, say

for k in random.choices(auList,k=100000000):
    a=auDict[k]
CPU times: user 1min 6s, sys: 1.07 s, total: **1min 7s**

GDBM does not travel bad against this

%%time
with db.open("AuDictJson.gdbm",'r') as d:
    for k in random.choices(auList,k=100000000):   
        a=d[str(k)]   
CPU times: user 2min 44s, sys: 1.31 s, total: **2min 45s**

And even a specialist precompiled table, as keyvy for json serializated lists, can do almost the same.

%%time
d = keyvi.Dictionary("AuDictJson.keyvi")
for k in random.choices(auList,k=100000000):   
    a=d[str(k)].GetValue()
CPU times: user 7min 45s, sys: 1.48 s, total: 7min 47s

Generically an embedded database, specially when it is read-only and single user, should be expected always to win an external one, because of the overhead of sockets and semaphores to access the resource. On the other hand, if your program is a service having already some external I/O bottleneck -say, you are writing a webservice-, the overhead to access the resource could unimportant.

Said that, you can see some advantage on using external databases if they provide extra services. For Redis, consider union of sets.

%%time
for j in range(1000):
    k=r.sunion(('s'+str(k) for k in random.choices(auList,k=10000)))  
CPU times: user 2min 24s, sys: 758 ms, total: 2min 25s

Same task with gbm is in the same order of magnitude. Albeit redis is still five times slower, it is not so slow as to discard it

%%time 
with db.open("AuDictPSV.gdbm",'r') as d:
    for j in range(1000):
        a=set()
        for k in random.choices(auList,k=10000):
            a.update(d[str(k)].split(b'|'))
CPU times: user 33.6 s, sys: 3.5 ms, total: 33.6 s

By using redis in this case, you get the full functionality of a database, beyond a simple datastore. Of course, with a lot of clients saturating it, of with a lot of single gets, it is going to perform poorly compared to the embedded resource.

As for gdbm competition, a 2014 benchmark by Charles Leifer shows that it can overperform KyotoCabinet for reads still tie for writting, and that one could consder LevelDB and RocksDB as advanced alternatives.

arivero
  • 777
  • 1
  • 9
  • 30