Estimating the space requirement in Cassandra

Question

What's the best/reliable method of estimating the space required in Cassandra. My Cluster consists of 2 nodes(RHEL 6.5) on Cassandra 3.11.2. I want to estimate the average size each row in every table will take in my database so that I can plan accordingly. I know about some methods like nodetool status command, du -sh command used in the data directory, nodetool cfstats etc. However each of these are giving different values and hence I'm not sure which one should I use in my calculations.

Also I found out that apart from the actual data, various metadata is also stored by Cassandra in various system specific tables like size_estimates, sstable_activity etc. Does this metadata also keep on increasing with the data? What's the ratio of space occupied by such metadata and the space occupied by the actual data in the database? Also what particular configurations in YAML(if any) should I keep in mind which might affect the size of the data.

A similar question was asked before but I wasn't satisfied by the answer.

Look to this: https://stackoverflow.com/questions/42736040/calculating-the-size-of-a-table-in-cassandra & look to slides from DS220 course at datastax academy - they have corresponding formulas there... — Alex Ott, Apr 03 '18 at 18:00
From DS220, it seems to me that approximately, the metadata will keep on increasing linearly with the number of rows in the table. Therefore, the best method of estimating the per row size of a table seems to me to first enter some rows of sample data in the table and then using "du -sh on the data directory find out the change in the size and then divided that by the number of rows. — Vishal Sharma, Apr 04 '18 at 14:20
What kind of metadata do you refer? Partitions keys, etc. are taken into account in formulas. You only need to know an average size of the text, or other variable-length fields — Alex Ott, Apr 04 '18 at 15:31
There are various tables, like size_estimates, sstable_activity( both in system keyspace) etc. which also store some data. I don't know if this data that these tables store, increases with the number of rows in our database but the formula given in DS220(calculating partition size on disk), also had a component(apart from the size of primary keys and other columns), 8*Nv, where Nv is the number of values(explained some slides before). I was assuming, this component to represent all the metadata that's added whenever we add a row. — Vishal Sharma, Apr 05 '18 at 05:29
In the long run the system tables shouldn't have so much influence, maybe except traces. The 8*Nv belongs to calculation of partition size that is in the table with data — Alex Ott, Apr 05 '18 at 06:42

score 0 · Answer 1 · answered Feb 08 '20 at 13:53

If you are expecting 20 GB of data per day, here is the calculation.

1 Day = 20 GB, 1 Month = 600 GB, 1 Year = 7.2 TB, so your raw data size for one year is 7.2 TB with replication factor of 3 it would be around 21.6 TB of data for one year.

Taking compaction into consideration and your use case being write heavy,if you go with size tiered compaction. you would need twice the space of your raw data.

So you would need around 43 TB to 45 TB of disk space.

Estimating the space requirement in Cassandra

1 Answers1