5

How can R* Tree be implemented as a persistent (disk based) one? What is the architecture of the file for saving the R* Tree index or for saving leaf values?

Notes: In addition how insert, update and delete operations can be performed in such a persistent R* Tree?

Notes II: I have implemented an in-memory R-Tree with bulk load functionality. But I think that is totally irrelevant when we speak about disk-based ones.

Kaveh Shahbazian
  • 13,088
  • 13
  • 80
  • 139
  • 1
    Did you consider using a database? – Philipp Nov 28 '12 at 12:19
  • 1
    A couple of years ago we were using Oracle geospatial tools. But it had 2 problems: 1) it was slow for our work load and 2) at some point we needed to search in a set of geofences (polygons) to see if a given point is inside them or not. And the other reason is we are planing to move away from Oracle. In the mean time I am handling that search (and finding NN and routes and etc) all by an in-memory application that I wrote myself. My new problem is I wrote it as if my original data were read only so I load balanced trees into memory. But adding new items force me to re-balance the tree. – Kaveh Shahbazian Nov 28 '12 at 12:38
  • MongoDB is pretty good at geospatial indexing. – Philipp Nov 28 '12 at 12:53
  • 1
    Well, **just start** and ask more specifically when you get stuck. Right now, this is not really a question that can be answered. – Has QUIT--Anony-Mousse Nov 28 '12 at 13:03
  • @Anony-Mousse Thanks for comment. I have factored out my main question from other notes. It should seem more meaningful by now. Please comment; Thanks. – Kaveh Shahbazian Nov 28 '12 at 14:09
  • 1
    Not really. The question still is "Where should I start". Wherever you are, just continue from there! – Has QUIT--Anony-Mousse Nov 28 '12 at 15:59

3 Answers3

9

Architecture of the file

Well, it's pages (= blocks). The pages should have a multiple of the page size of the underlying storage, so probably 1kb or 8kb blocks. Each block has a number and can be references this way.

Directory pages store the bounding boxes of the children and their page numbers.

Child pages store the actual data objects.

Managing the tree

Well, in theory: whenever you modify a page in memory, you write the changes to disk. That's it.

In practise, you might want to use a cache for performance, and you might want to have transactions for keeping your tree consistent in case of an application crash.

On both of these things you can find a lot of literature in the field of RDBMS architecture.

A key benefit of the R*-tree is that it is a regular page-oriented tree as you would have them in database systems all over the place. If you have a good on disk B+-tree implementation, you can reuse most of your code for the R*-tree.

How to get started

To get started, you need to get used to disk-based data indexing, as it is done in classic RDBMS. I'd suggest starting with an on disk B-tree or B+-tree. Do allow deletions, because you need to think about managing deleted pages and all this.

Once you figured out the B-Tree on disk (and maybe spend some time on optimizing it!), doing an on-disk R-tree should be fairly obvious.

I havn't looked at the code, but this might be a good starting point: http://www.die-schoens.de/prg/ or some of the others linked in Looking for a disk-based B+ tree implementation in C++ or C

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for your answer. But now I am totally lost! Is there any step by step reference for understanding how one would actually do that? I have downloaded and read many R(*) Tree implementations in Java and some in C and C++ but I did not understand a line! I am sure that I am in a very wrong view (and never before had done something like it). – Kaveh Shahbazian Dec 02 '12 at 11:06
  • 1
    There is more than one way to it. And it's **just not that simple**. You need to manage empty pages which you will get after performing deletions, for example. Maybe you should start with implementing a B+-tree *on disk*. Do *not* start with an in-memory implementation. Work on disk from the very beginning. – Has QUIT--Anony-Mousse Dec 02 '12 at 12:49
  • Unfortunately I have started with an in-memory R-Tree which I like much: built by bulk insert very fast (12 sec on my machine for 2`100`000 entries) and blazing fast search speed (under 1 micro sec) and I use this in production. Maybe that rotted my mind completely because I do not understand these codes at all! I have done what you said before and started from a B-Tree but that did not help neither! I can not factor out the idea of "STORAGE" from rest of the code :( – Kaveh Shahbazian Dec 02 '12 at 13:15
  • 2
    Sorry, but storage management is a lot of boring and tedious code. There is no one-liner that will just put it on disk. You have to face this reality and start working with low-level disk accesses. – Has QUIT--Anony-Mousse Dec 02 '12 at 13:38
4

If you need to have an on-disk R-Tree index, I would suggest using Spatialite or Postgis. Spatialite is lightweight and easy to embed in a standalone application. Alternatively, have you looked at the C# Spatial Index project?. I wrote an R-Tree implementation in Java a few years back and wouldn't recommend doing it if something already exists.

Shawn H
  • 1,227
  • 1
  • 12
  • 18
2

If you already have a main-memory implementation you may reuse the same code just add writes to the disk. You have to take into consideration page size and optimize tree nodes to fit in a page (you can read it in one go).

It would be better (performance wise) to have a snapshots of the main memory tree stored in the disk (snapshots could be taken when the tree is not under high pressure) versus writing every change into the disk.

In the question you specify that querying the tree is of higher importance thus you should be better with R*-tree since it minimizes the overlap between the tree nodes. However if your requirements would be focused on update operations (insert/delete) I would suggest to take a look at Supporting frequent updates in R-trees: a bottom-up approach paper.

Robertas
  • 1,164
  • 3
  • 11
  • 26
  • I need good performance with reads. Data will not change much but when it does I need the tree balance itself fast. Currently I am constructing the tree using all of my data (not by inserting nodes). – Kaveh Shahbazian Nov 28 '12 at 14:50