2

What kind of data structure would you use to implement a big persistent array like structure (size > 10 Mio elements)

The interface of the data structure should support the following operations:

YType get(IdxType idx) // random access  
void insertIdx(IdxType idx, YType yValue) // random insert   
void deleteIdx(IdxType idx) // random delete  

Let the index IdxType be a scalar unsigned value such as unsigned long long and YType a scalar value or a struct.

The complexity for these operations should never be bigger than O(log n) and it would be very efficient if the complexity for random access would drop to O(1) after a while, since for many use cases read operations will be used most.

These requirements rule out in memory data structures such as vectors or lists, which could be written to disc, since the insert complexity for vectors is O(n) and random access for lists is also O(n).

EDIT:
Please note, that a hash map or tree like data structures with index keys do not fulfill the requirements. The value for a certain index can change. i.e. when inserting a value for an index all the subsequent indexes change their values. This is exactly what happens to subsequent indexes in an array, when inserting an element.

mgr
  • 342
  • 2
  • 7
  • This should be on Programmers.SE – asheeshr Jan 04 '13 at 16:31
  • By persistent, you mean stored in a non-volatile medium (e.g. hard disk)? [Persistent data structure](http://en.wikipedia.org/wiki/Persistent_data_structure) can also mean something else. –  Jan 04 '13 at 16:34
  • how "sparse" is this supposed to be? I.e. - do you expect this to be 10% full? 50% full? 90% full? – p.marino Jan 04 '13 at 16:38
  • yes by "persisted" i mean stored in a non-volatile medium like a hard disk – mgr Jan 04 '13 at 16:38
  • about the sparseness. I would rather like to focus on time performace and do not care too much about space. – mgr Jan 04 '13 at 18:25
  • so it looks like such a data structure does not exist. – mgr Jan 07 '13 at 14:20

2 Answers2

0

You might want to use an order statistic tree, a tree data structure that supports O(log n) insertions, deletions, and lookups. Typically, order statistic trees are used to store sorted data, though it is relatively straightforward to modify them to store an unsorted sequence.

Intuitively, an order statistic tree is a binary tree where each node stores the number of children in its left subtree. That way, you can recursively look up the nth element as follows:

  1. If the tree is null, there is no element.
  2. If n is less than or equal to the number of elements in the left subtree (call it I), recursively look up the element in the left subtree.
  3. If n = k + 1, the root of the tree is the nth element.
  4. Otherwise, look up the n - k - 1st element in the right subtree.

To do an insertion before node n, use this same procedure to find the nth node and insert the new value as that nodes inorder predecessor (go left, then keep going right until you walk off the tree and insert the node there). Then walk back up the tree from nth element upward and adjust the value of k for all nodes on the insertion path that need to be updated.

To prevent the tree from getting too imbalance, you can use any standard tree balancing scheme (AVL, red/black, etc.) to keep the height at O(log n). This gives O(log n) performance for all the operations on the data structure.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • how would you persist this tree on disc? and if you have view insert and delete operations but many reads is there a way to improve the complexity for read operations? – mgr Jan 04 '13 at 17:34
  • You could persist the tree on disk by replacing each pointer with an offset into the file at which the node occurs, for example. Overall, you'll need at most O(log n) disk reads/writes per insertion, lookup, or access, which isn't too bad. If you wanted to cut back on disk accesses, you could consider adjusting this approach to work with B-trees rather than binary trees. Since B-trees are specifically designed to minimize disk reads and writes, this could be much, much faster. However, you'd probably have to do some more complex logic to maintain the subtree sizes. – templatetypedef Jan 04 '13 at 17:40
0

Depends on whether you need a concurrent access to it...

No

If you just need one client application that can access it at any given time, use an in-memory structure such as rope (or even just hash table with the "array" index as a key), that provides good balance between access and modification complexity. Once you are done with it, serialize it to a simple file. RAM is cheap and plentiful these days and 10 M elements should fit right into it unless each element is really big.

Yes

If you need the concurrent access, transactions et al., use a database. Your "array" will physically be a B-Tree whose key is the array index. Use clustering to eliminate the "unnecessary" table heap and keep everything in the B-Tree, if your DBMS supports it. Something like this:

CREATE TABLE THE_ARRAY (
    INDEX INT PRIMARY KEY,
    DATA VARCHAR(50) -- Whatever...
) ORGRANIZATION INDEX

(ORGRANIZATION INDEX is Oracle-specific syntax for what is called "clustering" in other DBMSes; use whatever syntax is appropriate for your DBMS.)

The insertion will be O(N) (since you'll need to update the indexes after the inserted one) and the lookup will be O(log(N)). Also, some DBMSes support hash indexes, which should allow for an O(1) lookup, but could be worse for the insertion than the B-Tree.

Community
  • 1
  • 1
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • hash tables and B-Trees like THE_ARRAY do not fulfill the requirements see edit. – mgr Jan 04 '13 at 18:19