Adding PH-Tree to ELKI

Question

I'm considering adding the PH-tree to ELKI. I couldn't find any tutorials for examples for that, and the internal architecture is not fully obvious to me at the moment.

Do you think it makes sense to add the PH-tree to ELKI?
How much effort would that be?
Could I get some help?
Does it make sense to implement only an in-memory version, as done for the kd-tree (as far as I understand)?

Some context: The PH-tree is a spatial index that was published at SIGMOD'14: paper, Java source code is available here. It is a bit similar to a quadtree, but much more space efficient, doesn't require rebalancing and scales quite well with dimensionality. What makes the PH-tree different from the R*-Tree implementations is that there is no concept of leaf/inner nodes, and nodes don't will not directly map to pages. It also works quite well with random insert/delete (no bulk-loading required).

Erich Schubert · Accepted Answer · 2015-07-29T09:33:20.743

Yes.

Of course it would be nice to have a PH-tree in ELKI, to allow others to experiment with it. We want ELKI to become a comprehensive tool; it has R-trees, M-trees, k-d-trees, cover-trees, LSH, iDistance, inverted lists, space-filling-curves, PINN, ...; there are working-but-not-cleaned-up implementations of X-tree, rank-cover-trees, bond, and some more.

We want to enable researchers to study which index works best for their data easily, and of course it would be nice to have PH-tree, too. We also try to push the limits of these indexes, e.g. when supporting other distance measures than Euclidean distance.

The effort depends on how experienced you are with coding; ELKI uses some well-optimized data structures, but that means we are not using standard Java APIs in a number of places because of performance. Adding the cover tree took me about one day of work, for example (and it performed really nicely). I'd assume a more flexible (but also more memory intensive) k-d-tree would be a similar amount of work. I have not studied the PH-tree in detail, but I'd assume it is slightly more effort than that. My guts also say that it won't be as fast as advertised. It appears to be a prefix-compressed quadtree. In my experiments bit-interleaving approaches such as required for Hilbert curves can be surprisingly expensive. It also probably only works for Minkowski metrics. But you are welcome to prove me wrong. ;-)

You are always welcome to as for help at the mailing list, or here.

I would do an in-memory variant first, to fully understand the index. Then benchmark it to identify optimization potential, and debug it. Until then, you may not have figured out all the corner cases, such as duplicate points handling, degenerated data sets etc.

Always make on-disk optional. If your data fits into memory, a memory-only implementation will be substantially faster than any on-disk version.

When contributing to ELKI, please:

avoid external dependencies. We've had bad experience with the quality of e.g. Apache Commons, and we want to have the package easy to install and maintain, so we want to keep the .jar dependencies to a minimum (also, having tons of jars with redundant functionality comes at a performance cost). I'm inclined to only accept external dependencies for optional extension modules.
do not copy code from other sources. ELKI is AGPL-3 licensed, and any contribution to ELKI itself should be AGPL-3 licensed, too. In some cases it may be possible to include e.g. public domain code, but we need to keep these to a minimum. We could probably use Apache licensed code (in an external library), but shouldn't mix them. So from a quick look, you are not allowed to copy their source code into ELKI.

If you are looking for data mining project ideas, here is a list of articles/algorithms that we would love to see contributed to ELKI (we keep this list up to date for student implementation projects):

http://elki.dbs.ifi.lmu.de/wiki/ProjectIdeas

Thanks. Actually, I'm the original author of the paper and PH-tree, so I hope it holds up to the results that I published :-). My idea was to contribute the original code, which is already quite optimised. I'll be happy to dual-license it with AGPL-3 (dual licensing is often discouraged, but it seems to be the easiest option). I actually started implementing it into ELKI, but I have trouble understanding some of the API. But I can ask the details on the mailing list. — TilmannZ, Jul 30 '15 at 09:20
The original author can of course relicense his code as AGPL-3, and donate it to ELKI. I would prefer to get a relicensed contribution instead of a dual-licensed, so that there is less confusion (and less exceptions to keep track of). MinimalisticKDTree and CoverTree (in git) are probably the easiest indexes to understand. KD-tree shows how to make an index for Euclidean distance only; the CoverTree relies on metric properties only, and doesn't see the actual data. — Erich Schubert, Aug 03 '15 at 07:54
I don't want to give up the APL, because it will be useful/required in another project. I wonder whether I could create a fork and relicense that... — TilmannZ, Aug 04 '15 at 09:19
As author, for all I know, you are free to make a copy of the source and stick a different license to it; keeping one as APL, and another one as AGPL-3. They are a license to do certain things, not exclusive marketing deals. — Erich Schubert, Aug 04 '15 at 13:09

Adding PH-Tree to ELKI

1 Answers1

Yes.