Yes.
Of course it would be nice to have a PH-tree in ELKI, to allow others to experiment with it. We want ELKI to become a comprehensive tool; it has R-trees, M-trees, k-d-trees, cover-trees, LSH, iDistance, inverted lists, space-filling-curves, PINN, ...; there are working-but-not-cleaned-up implementations of X-tree, rank-cover-trees, bond, and some more.
We want to enable researchers to study which index works best for their data easily, and of course it would be nice to have PH-tree, too. We also try to push the limits of these indexes, e.g. when supporting other distance measures than Euclidean distance.
The effort depends on how experienced you are with coding; ELKI uses some well-optimized data structures, but that means we are not using standard Java APIs in a number of places because of performance. Adding the cover tree took me about one day of work, for example (and it performed really nicely). I'd assume a more flexible (but also more memory intensive) k-d-tree would be a similar amount of work. I have not studied the PH-tree in detail, but I'd assume it is slightly more effort than that.
My guts also say that it won't be as fast as advertised. It appears to be a prefix-compressed quadtree. In my experiments bit-interleaving approaches such as required for Hilbert curves can be surprisingly expensive. It also probably only works for Minkowski metrics. But you are welcome to prove me wrong. ;-)
You are always welcome to as for help at the mailing list, or here.
I would do an in-memory variant first, to fully understand the index. Then benchmark it to identify optimization potential, and debug it. Until then, you may not have figured out all the corner cases, such as duplicate points handling, degenerated data sets etc.
Always make on-disk optional. If your data fits into memory, a memory-only implementation will be substantially faster than any on-disk version.
When contributing to ELKI, please:
- avoid external dependencies. We've had bad experience with the quality of e.g. Apache Commons, and we want to have the package easy to install and maintain, so we want to keep the .jar dependencies to a minimum (also, having tons of jars with redundant functionality comes at a performance cost). I'm inclined to only accept external dependencies for optional extension modules.
- do not copy code from other sources. ELKI is AGPL-3 licensed, and any contribution to ELKI itself should be AGPL-3 licensed, too. In some cases it may be possible to include e.g. public domain code, but we need to keep these to a minimum. We could probably use Apache licensed code (in an external library), but shouldn't mix them. So from a quick look, you are not allowed to copy their source code into ELKI.
If you are looking for data mining project ideas, here is a list of articles/algorithms that we would love to see contributed to ELKI (we keep this list up to date for student implementation projects):
http://elki.dbs.ifi.lmu.de/wiki/ProjectIdeas