4

I'm working on a project in which we import 50k - 100k datapoints every day, located both temporally (YYYYMMDDHHmm) and spatially (lon, lat), which we then dynamically render onto maps according to the query parameters set by our users. We do use pre-computed clusters below a given zoom level.

Within this context and given the fact that we're in the process of selecting a database engine for our storage layer, I'm currently evaluating Cassandra and BigTable's variants.

Specifically, I'm trying to understand the difference between using composite partition keys in Cassandra vs. interleaved index keys in BigTable, such as the one GeoMesa uses.

As far as I understand, both these approaches can leverage COTS hardware and can be tuned to reduce hotspotting and maximize space-filling.

What are the logical steps I should follow in order to discriminate between the two? Even though I am planning on testing both approaches in the near future, I'd like to hear a more reasoned and educated approach.

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
Jacoscaz
  • 159
  • 1
  • 7

1 Answers1

-1

GeoMesa actually supports both BigTable clones like Accumulo and Cassandra. The Cassandra support, at the time of writing, is currently in an early phase. The README has a description of the indexing scheme.

Both implementations utilize Z2 or Z3 (depending on whether the index is just spatial or spatio-temporal) interleaved indexes. The BigTable clone indexing puts the full resolution Z3 into the primary key. Queries are just range scans on the sorted keys. Cassandra requires that partition keys be explicitly enumerated (unless you're doing full table scans). Because of that face, GeoMesa's Cassandra indexing uses composite keys to spread the information across both the partition key and the range key. The partition key is a coarse spatio-temporal key that buckets the world into NxN cells. Then, the range key is the full resolution Z3 interleaved index. Queries are decomposed into an enumeration of the overlapping buckets (partition key) and Z3 ranges within each bucket (range key). Having to enumerate the partition keys can cause a lot of network chattiness in order to satisfy a query. Setting up the bucket resolution is key to reducing this chattiness.

antf
  • 1