1

I used to do some small Python programs for simple data analysis. It is easy to use and efficient.

Recently I start to ran into situations that the size of data in my problem is just too big to fit entirely in the memory for Python to process.

I have been researching possible persistence implementations for Python. I found pickle and some other libraries that are quite interesting but not exactly what I am looking for.

Simply put, the way pickle handles persistence is not transparent to the program. Programmer need to handle it explicitly - to load or to save etc.

I was thinking if it can be implemented in the way that it can be programmed more seamlessly. For example,

d1 = p({'k':'v'}) # where p() is the persistent version of dictionary
print d1.has_key('k') # which gives 'v', same as if it is an ordinary dictionary
d2.dump('dict.pkl')  # save the dictionary in a file, or database, etc

That is, to overload the dictionary methods with a persistent version. It looks doable to me but I need to find out exactly how many methods that I need to work on.

Browsing the Python source could help but I haven't really dig into that deep level. Hope you can offer me some pointers and direction into this.

Thanks!

EDIT

Apologies that I am not very clear in my original question. I wasn't really look into save the data structure but rather to look for some internal "paging" mechanism that can run behind the scene when my problem rans out of memory. For example,

d1 = p({}) # create a persistent dictionary
d1['k1'] = 'v1' # add
# add another, maybe 1 billion more, entries on to the dictionary
print d1.has_key('k9999999999') # entry that is not in memory

Totally behind the scene. No save/load/search required from the programmer.

chapter3
  • 914
  • 2
  • 12
  • 21
  • I think it would be a better idea to create your own classes to handle files and expose methods that are similar to regular data structures, like add, append, delete, contains etc. Well this can be extended to a much larger project. In case of set, it might be much much easier to use bloom filter with large number of bits - it doesn't store data but hashes based on incoming elements. Well we have to identify the limits under which using files might be an overkill – Aditya Jun 11 '15 at 06:00
  • There are a couple of examples here : http://stackoverflow.com/questions/9449674/how-to-implement-a-persistent-python-list – Aditya Jun 11 '15 at 06:00
  • I agree with the comment from @AdityaJoshi. Check out https://github.com/seomoz/pyreBloom. – okoboko Jun 11 '15 at 06:01
  • @okoboko cool! I wasn't aware of this tool. – Aditya Jun 11 '15 at 06:12
  • @AdityaJoshi Thanks for the link! But that's more like enhance to save the list *in memory*. I am looking if the data structure can become persistent behind the scene when the program is manipulating it (sorry maybe not very clear in the original question). – chapter3 Jun 11 '15 at 06:21
  • @okoboko Thanks for the resource! That's very interesting – chapter3 Jun 11 '15 at 06:21

3 Answers3

2

Check out ZODB. http://www.zodb.org/en/latest

It is a proven solution with transactional features.

sureshvv
  • 4,234
  • 1
  • 26
  • 32
  • Thanks for the advice! I did checked out ZODB but I thought that would be an overkill for my situation here. It is a full-blown database engine. – chapter3 Jun 11 '15 at 05:55
  • 1
    You don't have to use it all at once. It is remarkably well thought out, so you can use just parts of it. – sureshvv Jun 11 '15 at 07:38
0

anydbm works almost exactly like your example and should be reasonably fast. One issue is that it only handles string keys and string contents. I'm not sure if opening and closing the db is too much overhead. You could probably wrap this in a context manager to make it a bit nicer. Also, you'd need some magic to use different filenames each time p is called.

import anydbm
def p(initial):
    d = anydbm.open('cache', 'c')
    d.update(initial)
    return d

d1 = p({}) # create a persistent dictionary
d1['k1'] = 'v1' # add

# add another, maybe 1 billion more, entries on to the dictionary
for i in xrange(100000):
    d1['k{}'.format(i)] = 'v{}'.format(i)

print d1.has_key('k9999999999') # entry that is not in memory, prints False

d1.close() # You have to close it yourself
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • thanks for your advice! how am i going to update the anydbm object each time the dictionary is getting updated? Not sure if I read your code correctly – chapter3 Jun 11 '15 at 07:35
  • `d1` behaves just as you describe in your original request, in other words when you do `d['k1'] = 'v1'`, it actually writes to a file transparently, there is no dictionary in memory, it's all going to the file. – chthonicdaemon Jun 11 '15 at 08:17
  • thanks again! let me try it out more details before I mark yours as the answer. thx! – chapter3 Jun 11 '15 at 10:46
-1

web2py has a really nice database abstraction layer (DAL) for this sort of thing and comes with sqlite native, though you can swap sqlite out for a different database, like postgresql, etc. For your case, sqlite should be adequate. Your example would translate like this:

# model goes into one file
# there's some preamble stuff I'm not showing here
db.define_table('p', Field('k'))

# controller goes into separate file
d1 = db.p.insert(k='v')  # this saves k='v' into the persistent 'p' database table, returning the record number, which is assigned to d1
print db.p[d1].k  # this would print "v"

Model and controller would go into separate files. You can use web2py just for the DAL. Or, you can use it's python templating capabilities, too, to make your app web enabled.

When reading back more than one record at a time, you can cast the db as_dict or an as_array. Check out the DAL documentation for details.