Store/retrieve a data structure

Question

I have implemented a suffix tree in Python to make full-text-searchs, and it's working really well. But there's a problem: the indexed text can be very big, so we won't be able to have the whole structure in RAM.

enter image description here

IMAGE: Suffix tree for the word BANANAS (in my scenario, imagine a tree 100000 times bigger).

So, researching a little bit about it I found the pickle module, a great Python module for "loading" and "dumping" objects from/into files, and guess what? It works wonderfully with my data structure.

So, making the long story shorter: What would be the best strategy to store and retrieve this structure on/from disk? I mean, a solution could be to store each node in a file and load it from disk whenever is needed, but this isn't the best think to do (too many disk accesses).

Footnote: Although I have tagged this question as python, the programming language isn't the important part of the question, the disk storing/retrieving strategy is really the main point.

An important question is whether you want to create this structure *once* and use it many times, or whether you want to create it and *allow updates*. — Greg Hewgill, Nov 28 '11 at 18:19
@GregHewgill: basically, just one big text-processing to create the structure, and then just use it. — juliomalegria, Nov 28 '11 at 18:23
by the way - use cPickle - much faster the pickle. also why not using json and instead working vs the disk work against a nosql db (I'm not an expert is this subject to tell you which one- but NosSql are known solution to exactly this kind of scenerios - much better I believe than multiple disk files) — alonisser, Nov 28 '11 at 18:35
@alonisser: NoSQL is a group of possible solutions, but MySQL with data retrieval based only on primary keys is often as efficient, as NoSQL solutions. — Tadeck, Nov 28 '11 at 18:44
@Tadeck you are probably right since I'm not an expert in the subject - also I believe that something like mongoDB or a couchDB as a document store has a more native approach to using a json interface. — alonisser, Nov 28 '11 at 18:53

score 4 · Answer 1 · answered Nov 28 '11 at 18:49

An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

Thanks all for answering, after testing all of your options and after a lot of deliberation I've made my decision: I'm gonna use `ZODB`, a very interesting module, check it out if you like this kind of things. — juliomalegria, Dec 01 '11 at 19:07

score 3 · Accepted Answer · edited Nov 28 '11 at 21:06

3

If pickle is already working for you, you may want to take a look at ZODB which adds some functionality on top of pickle. Looking at the documentation, I saw this paragraph that looks to address the size concerns you're having:

The database moves objects freely between memory and storage. If an object has not been used in a while, it may be released and its contents loaded from storage the next time it is used.

edited Nov 28 '11 at 21:06

juliomalegria

24,229
14
73
89

answered Nov 28 '11 at 18:59

jcollado

39,419
8
102
133

Thanks all for answering, after testing all of your options and after a lot of deliberation I've made my decision: I'm gonna use `ZODB`, a very interesting module, check it out if you like this kind of things. – juliomalegria Dec 01 '11 at 19:07

Adrian · Answer 3 · 2011-12-16T21:36:47.483

2

Try a compressed suffix tree instead.

The main idea is that instead of having lots of nodes of 1 character, you can compact them into 1 node of multiple characters thus saving extra nodes.

This link here (http://www.cs.sunysb.edu/~algorith/implement/suds/implement.shtml) says you can transform a 160MB suffix tree to 33MB compressed suffix tree. Quite a gain.

These compressed trees are used for genetic substring matching on huge strings. I used to run out of memory with a suffix tree, but after I compressed it, the out of memory error disappeared.

I wish I could find an unpaid article which explains the implementation better. (http://dl.acm.org/citation.cfm?id=1768593)

edited Dec 16 '11 at 21:36

answered Dec 16 '11 at 03:10

Adrian

5,603
8
53
85

but even if the tree is compressed, it has to be stored in disk. – juliomalegria Dec 16 '11 at 21:34
eventually yes, depending on how large your data is. DNA sequences are pretty long and compressed suffix trees work with them (all in memory). – Adrian Dec 16 '11 at 21:35

score 1 · Answer 4 · answered Nov 28 '11 at 18:42

1

What about storing it in sqlite?

SQLite:

has support for up to 2 terabytes of data,
supports SQL queries,
is great replacement for storing in-app data,
can support ~100k visits per day (if used for average web application),

So it may be good to take a closer look at this solution, as it has proven to be the efficient way to store data within applications.

answered Nov 28 '11 at 18:42

Tadeck

132,510
28
152
198

Thanks all for answering, after testing all of your options and after a lot of deliberation I've made my decision: I'm gonna use `ZODB`, a very interesting module, check it out if you like this kind of things. – juliomalegria Dec 01 '11 at 19:08

Cédric Julien · Answer 5 · 2011-12-07T23:03:19.323

1

Maybe you could combine cPickle and a bsddb database that will allow you to store your pickled nodes in a dictionnary-like object that will be stored on the filesystem; thus you could load the database later and get from the nodes you need with really good performances.

In a very simple way :

import bsddb
import cPickle

db = bsddb.btopen('/tmp/nodes.db', 'c')
def store_node(node, key, db):
    db[key] = cPickle.dumps(node)

def get_node(key, db):
    return cPickle.loads(db[key])

edited Dec 07 '11 at 23:03

answered Nov 28 '11 at 18:48

Cédric Julien

78,516
15
127
132

Thanks all for answering, after testing all of your options and after a lot of deliberation I've made my decision: I'm gonna use `ZODB`, a very interesting module, check it out if you like this kind of things. – juliomalegria Dec 01 '11 at 19:09

Store/retrieve a data structure

5 Answers5

Linked