0

I'm trying Cassandra to replace mysql at a large dataset I have (2.5Tb/5 billion rows) that I can't scale more in a single server.

I insert/update a few million rows every hour. Currently, I'm inserting and querying one by one in cassandra because I don't know which partition has the data, and grouping them seem to be slower. But one by one, I can't match the speed of a single mysql server even with 3 cassandra nodes.

In mysql, I can batch because I know it stores all in the same server. Is it possible, using the value of the primary key, to determine the partition on client side, so I can group the queries more effectively with BATCH or SELECT..IN?

I mean, given a group of PKs like 1, 2, 3, 4, 5, 6 ... and N servers, i'd like to know that say, rows 1 3, 5 are in the same partition, so I can group then in my queries. Is this possible with cassandra?

Natan
  • 4,686
  • 5
  • 30
  • 48
  • Check if [this](https://stackoverflow.com/questions/46928762/determine-node-of-a-partition-in-cassandra/) helps – Horia Jun 08 '18 at 06:38
  • @Horia I don't think so, because this depends on the nodetool or an instance, and that would kill the performance. What I need is more the algorithm or some way for the driver itself do it. – Natan Jun 08 '18 at 11:27

1 Answers1

1

If you're performing queries with WHERE on partition key, then most of time drivers take care of most effective routing of data to replicas that have this data (only if you didn't change load balancing policy - by default all drivers use so-called TokenAware policy) by calculating token for given partition key, and find replica(s) for it.

If you need to fetch multiple entries, then running N queries in parallel via async API & merging results on client side will be more effective than performing query with IN.

P.S. In Cassandra BATCH has slightly different semantic than in relational databases. Please check this documentation for recommended patterns.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • I didn't mark this before, but thanks. That is right. My problem was that I was connecting to the cluster through NAT. After enabling debug on the driver, I could see it was trying to connect to the other instances through their private addresses and couldn't, so it was routing everything through the same instance. After adding a translation to the driver, it connected to all instances and routed properly directly to each instance. I ended up using a simple select with prepared statement and it was much faster than the IN clause. – Natan Jun 19 '18 at 16:59