9

I want to keep a large ordered list (millions of elements) in Google App Engine datastore. Fast insertion is required.

The simplest way would be adding an indexed property (or column) "order_num" representing the order. For example, a list [A, B, C] would be stored like this:

content   order_num
--------------------
   A         1
   B         2
   C         3  

However, this doesn't give you fast insertion. For example, If I want to insert X after A, I have to renumber B and C to "make room" for X, i.e., let B become 3, C becomes 4, and X be 2. This would be a disaster if I have millions of elements.

I found a feasible solution called "gap approach" described here. This approach keeps a gap between adjacent elements. Like this:

content   order_num
--------------------
   A         1000
   B         2000
   C         3000

When I want to insert X after A, I can simply add X with its order_num set to (1000 + 2000) / 2 = 1500, no renumbering required.

But with these gaps becoming smaller, renumbering may be required. My question is, is there any known strategy on renumbering? And deciding the size of gaps?

Thanks!

UPDATE

Here's more detail. Say I have a list of elements in database, and every element has an integer property named my_num. The value of my_num is an arbitrary positive integer. Suppose I have a list [A, B, C, D], and their my_num are

 element    my_num   
---------------------
   A          5        
   B          2
   C         10
   D          7

Now, let's define an accum() operator:

accum(n) = element[0].my_num + element[1].my_num + ... + element[n-1].my_num

So the accum values for each element are

 element    my_num   accum 
----------------------------
   A          5        5
   B          2        7
   C         10       17
   D          7       24

But accum values probably should NOT be stored in database because the list is constantly updated. It's better to keep insertion fast.

I want to design a query which input is an integer x:

query(x) = element[i] if accum(i-1) < x <= accum(i)

For example, query(11) is C and query(3) is A.

Is it possible to design a datastore schema to make this query fast? Or the only way is accumulate it one by one at query time which I'm planning to do?

Community
  • 1
  • 1
eliang
  • 503
  • 3
  • 12
  • Do you need good *read* performance too? Are you going to be processing this list somehow later in the background or do you need to return sections of your list quickly to your users? – Chris Farmiloe Apr 13 '11 at 16:05
  • The list is for background processing. – eliang Apr 13 '11 at 16:14
  • Is there no data that you could use to define the order? If it depends on a time, a price, a heuristic value, you'd be able to insert that? – boisvert Apr 14 '11 at 06:56
  • @boisvert No, that's why it baffles me. – eliang Apr 14 '11 at 15:56
  • Have you implemented @boisvert's solution in your project? Can you give some details about your final implementation? – cryss May 12 '13 at 17:45

3 Answers3

11

alternatively, could you use decimals, or a string?

content     order
-------------------- 
   A         'a' 
   B         'b' 
   C         'c'

Then to insert D between a and b, give it the value 'aa'

An algorithm for generating the strings is best shown for a binary string: if you want to insert something between "1011" and "1100", do the following:

  • Avalue = 1+0*(1/2)+1*(1/4)+1*(1/8)
  • Bvalue = 1+1*(1/2)+0*(1/4)+0*(1/8)

average, new value = 1+0*(1/2)+1*(1/4)+1*(1/8)+1*(1/16) new string = "10111"

content     order
-------------------- 
   A         '1011' 
   new!      '10111' 
   B         '1100' 
   C         '1101'

since you always average 2 values, the average will always have a finite binary development, and a finite string. It effectively defines a binary tree.

As you know binary trees don't always turn out well balanced, in other words, some strings will be much longer than others after enough insertions. To keep them short, you could use any even number base - it has to be even because then the development of any average of two values is finite.

But whatever you do, strings will probably become long, and you'll have to do some housekeeping at some point, cleaning up the values so that the string space is used efficiently. What this algorithm gives you is the certainty that between cleanups, the system will keep ticking along.

boisvert
  • 3,679
  • 2
  • 27
  • 53
  • Floating point numbers have limit, too. You can't split decimals infinitely. Using string as index solves the problem, but as the number of elements grows, the strings become longer. Long strings could result in bad performance and waste disk space. – eliang Apr 13 '11 at 15:56
  • 1
    @eliang, the concern about wasted disk-space is probably not a good argument against this. With the 'gap method' if your data is dense enough that you aren't wasting a lot of space storing zeros you're probably not getting the full benefit of using gaps in the first place (sense you'll likely be running over all your data adding another zero quite often). – Robert Kluin Apr 13 '11 at 17:46
  • @Robert You got the point. If I used string approach, are there any _known_ strategies to generate these order strings? Thanks! – eliang Apr 14 '11 at 05:52
  • @eliang, edit my answer to give such a strategy. Maybe the question should be retagged data-structures algorithms or such if you want to pick the brains of folks more experienced in that area. – boisvert Apr 14 '11 at 07:41
  • @boisvert, I got the idea. Thanks! So it still needs some housekeeping work? Like gap approach needing for renumbering? – eliang Apr 14 '11 at 15:45
  • @eliang - it may still need housekeeping work, because if you're unlucky, you'll end up with long strings. Say you keep adding to the top of your list. Above item 1 you get items 01, 001, 0001... After a few hundred you'll want to clean up your numbering. Depending on data type, you won't run out of space, but you'll get a wasteful process eventually. – boisvert Apr 14 '11 at 16:09
  • 1
    also... 500 chars is max length for an indexed char field on App Engine. Beyond that you have to use a text field, but they can't be indexed... so after lots of insertions you may hit a hard limit on this approach. I don't have a better one! – Anentropic May 29 '12 at 17:31
2

You probably want to consider using app-engine-ranklist, which uses a tree-based structure to maintain a rank order in the datastore.

Or, if you can describe your requirements in more detail, maybe we can suggest an alternative that involves less overhead.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Spent some time looking into it. Great solution to rank problems. It provides fast query from score to rank and vice versa. But what I want is a different kind of query which I added in my original post. Thanks! – eliang Apr 14 '11 at 15:43
  • @eliang You can adapt ranklist's tree based approach for your needs. Or, if your dataset is small, accumulate results at runtime as you describe. Or, if it's big but infrequently updated, update results at write-time and store them persistently. – Nick Johnson Apr 15 '11 at 01:17
1

You could make a giant linked-list.... with each entity pointing to the next one in the list.

It would be extremely slow to traverse the list later, but that might be acceptable depending on how you are using the data, and inserting into the list would only ever be two datastore writes (one to update the insertion point and one for your new entity).

In the database, your linked list can be done like this:

value (PK)   predecessor
------------------------
  A              null
  B               A
  C               B

then when you insert new data, change the predecessor:

value (PK)   predecessor
------------------------
  A              null
  B               A
  C               D
  D               B

Inserting is quick, but traversing will be slow indeed!

boisvert
  • 3,679
  • 2
  • 27
  • 53
Chris Farmiloe
  • 13,935
  • 5
  • 48
  • 57
  • I thought of this method. But it uses too many disk access when traversing the list. The list will be processed per element basis, so I think it is best to fetch many elements at once to reduce disk access. – eliang Apr 14 '11 at 05:47