1

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post

Is the following implementation possible and what are known pitfalls (defects, functional limitations)?

1) Run Cassandra as an embedded service, persisting data to disk (durable).

2) Java application interacts with local embedded service via one of the following. What are the pros

  • TMemoryBuffer (or something more appropriate?)
  • StorageProxy (what are the pitfalls of using this API?)
  • Apache Avro? (see question #5 below)

3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).

4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?

5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?

As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.

noahlz
  • 10,202
  • 7
  • 56
  • 75
  • I think the Thrift/Avro information is out-of-date. Most of the recent work is on the CQL interface, which initially ran through Thrift, but I think is moving over to a binary protocol - see https://issues.apache.org/jira/browse/CASSANDRA-2478 – DNA Oct 10 '12 at 19:58
  • As in: Thrift is here to stay? – noahlz Oct 10 '12 at 20:00
  • 1
    More like: Avro is not coming. I imagine Thrift will stay around as a lot of existing deployments rely on it, but newer work will probably favour CQL Binary. Also, could you clarify what you mean by "low latency" - how low? Finally, bear in mind that "write to the local ... service" doesn't entirely make sense, because in a Cassandra cluster, data is assigned to node(s) according to the [Partitioner](http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/) - it isn't necessarily stored on the node you actually connect to, whether embedded or not. – DNA Oct 10 '12 at 20:06
  • Low-latency = sub-millisecond (< 1ms) – noahlz Oct 10 '12 at 20:11
  • What kind of durability guarantees (if any) do you need? In other words, could you use a primarily in-memory system like Redis? See also http://stackoverflow.com/questions/1316852/alternative-to-memcached-that-can-persist-to-disk – DNA Oct 10 '12 at 20:19
  • Yes, needs to persist to disk. Updated question. – noahlz Oct 10 '12 at 20:28

1 Answers1

1

Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

rs_atl
  • 8,935
  • 1
  • 23
  • 28
  • So I can't "force" Cassandra to write to the embedded instance as its primary, and then have a quorum for "backup?" And I presume there is not an API for devising your own custom consistency level? Basically, Cassandra's doesn't fit my needs and I perhaps should consider an alternative store... – noahlz Oct 11 '12 at 14:11
  • 1
    You can't force it because it may not own the key you're writing. If you can shard your transactions such that you're only writing keys that it owns, then perhaps this could work. But that seems tenuous. Perhaps you should do your blocking write to some in-memory store to satisfy your latency requirement, then asynchronously write to Cassandra. – rs_atl Oct 11 '12 at 14:37
  • And there is no API for "custom" Consistency levels. Thanks! – noahlz Oct 11 '12 at 15:50