1

We are making the move from MySQL to a globally-distributed, NoSQL solution due to hitting the performance ceiling. One consideration is Cassandra.

Our rows are small (6 fields, ~100bytes per row), but we need to store 250 million of them. At most, our searches will return 1000 rows at a time, based on 2 fields.

I am reading a lot about wide rows, but not sure our data model will work.

Is Cassandra suitable for storing this type of data?

Andrew
  • 1,226
  • 2
  • 13
  • 20
  • Remember that cassandra was used to store tweets, short messages stoed within a few columns in a single table. This sounds like a pretty good scenario for using cassandra. (not posting as an answer because I want others' views) – Lyuben Todorov Aug 15 '13 at 13:10
  • @LyubenTodorov I don't think the twits were stored the way you seem to think. You can check the sample `twissandra` project: https://github.com/eevans/twissandra they are actually stored where user is a row key and columns are tweetid. – lpiepiora Aug 15 '13 at 15:01
  • 1
    @lpiepiora Well, they have a few columns (cql talk, if we switch to thrift i mean collection of columns) Also it is in a single table (or column family) `CREATE TABLE tweets (tweetid uuid PRIMARY KEY, username text, body text);` I see 3 columns in 1 table, but thanks for your comment. – Lyuben Todorov Aug 15 '13 at 17:10
  • @LyubenTodorov correct, but I don't think they would try to access these twits in bulk. They construct `userline` for that, where they use PK as `PRIMARY KEY(username, tweetid)`, which in fact would be stored in Cassandra as `username` being a key and `tweetid` being a column name, which via column slice can give you quick access to recent tweets (in fact I didn't look through the sources so that's all my assumptions). – lpiepiora Aug 15 '13 at 17:26
  • 1
    To avoid hotspots, you may have to use RP and since the rows are stored randomly across your data centers, cassandra may have to contact more than one data center to retrieve, your rows. Getting 1000 rows using this approach is a bad idea. Design your data model in a way that each query will get data from a single row but with care not to go too wide as the whole row must be stored on the same single disk. – qualebs Aug 15 '13 at 19:31
  • @qualebs, thanks for the response. Our data is minimal (100bytes per entry), so 1000-5000 columns in a wide row would be fine for a single disk. Restructuring to remove duplicated data would bring this down by 50%, so not a problem about disk space. – Andrew Aug 16 '13 at 05:21

1 Answers1

1

I think it would be better if your design your data in a way it is stored as wide rows.

Using new CQL3 capabilities it could still look to you as it was small rows, but Cassandra would organize it as wide rows. I don't think iterating over the rows is the most efficient way. I find this article pretty explanatory on this subject: http://www.datastax.com/dev/blog/thrift-to-cql3.

Maybe you could cast some light on how your data model looks more or less? When dealing with Cassandra you have to think first about how you'd like to query your data, and very often denormalize.

lpiepiora
  • 13,659
  • 1
  • 35
  • 47
  • Both this response and the comments to my question provide the answer. The data will always be accessed using 3 fields of a 5 field key, with a maximum of 5000 possible results. Storing the 5000 rows in a wide row format with super columns will work perfectly. The missing element for me was that unlike MongoDB, you can update a single column in a row, rather than the entire row itself. I was concerned about transactions and locking, but Cassandra apparently does not need this. – Andrew Aug 16 '13 at 05:15
  • @Andrew please consider using composite column instead of super columns. You can check http://stackoverflow.com/questions/11915255/why-are-super-columns-in-cassandra-no-longer-favoured for the reason why. – lpiepiora Aug 16 '13 at 13:28