Does Cassandra read the whole row when limiting the number of requested results?

Question

I am using cassandra 2.0.6. and have this table:

CREATE TABLE t (
    id text,
    idx bigint,
    data bigint,
    PRIMARY KEY (id, idx)
)

So say I got these rows:

id / idx / data
x    1     data1
x    2     data2
x    3     data3

.... goes on say 1000 rows for x

If I query :

select * from t where id='x' order by idx limit 1

Will cassandra fetch all the 1000 rows , or only a small part of it?

Reading articles like http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/#.UzrvLKZx2PI , it seems it will fetch only a small part of it. But running some stress tests and the more data I have in the table, the more MB/sec disk IO I get.

For 8GB of data I was getting 3MB/sec IO (reads) For 12GB of data I was getting 15MB/sec IO (reads) For 20GB of data, I am currently getting 35MB/sec IO (reads)

I don't see anything weird in cfhistograms:

SSTables per Read
1 sstables: 421010
2 sstables: 552
3 sstables: 9
4 sstables: 0
5 sstables: 254
6 sstables: 3221
7 sstables: 3063
8 sstables: 1029
10 sstables: 143

Read Latency (microseconds)
12 us: 6
14 us: 36
17 us: 471
20 us: 2795
24 us: 10799
29 us: 18594
35 us: 24693
42 us: 43078
50 us: 67438
60 us: 68872
72 us: 70718
86 us: 47300
103 us: 23471
124 us: 11752
149 us: 4509
179 us: 1437
215 us: 832
258 us: 3444
310 us: 7883
372 us: 2374
446 us: 736
535 us: 624
642 us: 581
770 us: 1875
924 us: 1715
1109 us: 2889
1331 us: 3705
1597 us: 2197
1916 us: 1320
2299 us: 826
2759 us: 639
3311 us: 431
3973 us: 312
4768 us: 213
5722 us: 106
6866 us: 72
8239 us: 44
9887 us: 36
11864 us: 25
14237 us: 16
17084 us: 23
20501 us: 20
24601 us: 15
29521 us: 28
35425 us: 21
42510 us: 20
51012 us: 49
61214 us: 49
73457 us: 29
88148 us: 23
105778 us: 35
126934 us: 23
152321 us: 17
182785 us: 13
219342 us: 10
263210 us: 8
315852 us: 3
379022 us: 8
454826 us: 10

score 1 · Answer 1 · answered Apr 01 '14 at 17:15

1

You get more I/O as you are ordering and limiting on the fly. If you are sure about the order in which you want to fetch the data , use clusterordering on the column family at the time of creation itself

create table tablename(.......) with cluster order by (idx desc)

By this way, all your inserts are ordered by idx in descending order by default. Hence , when you apply limit on it,you shall reduce the disk I/O

answered Apr 01 '14 at 17:15

Ananth

971
9
23

don't I already have idx ordered (for the same id), since idx is the secondary primary key? – kostas.kougios Apr 01 '14 at 17:28
It will be but will always be in ascending order. I have given an option to do in descending if that's your requirement . If you are concerned about ascending , then you shall query without order by as you asked. – Ananth Apr 01 '14 at 17:34
aha, ok I see, let me give it a try – kostas.kougios Apr 01 '14 at 17:35
still I am getting a lot more I/O when my data are about 10GB than when they were say 1GB. ~15GB/sec compared to 3GB/sec. Though indeed your comment is useful as my table should have been ordered by idx desc. – kostas.kougios Apr 01 '14 at 19:57
is there a way to find out what is causing the I/O? (I take care to take measurements when no compaction is occuring) – kostas.kougios Apr 01 '14 at 20:14
There is a difference between memtable flushing and compaction to bigger stables. If you see your cfstats, you can clearly see memtable flushes have happened many times and hence data is split across ssttables. – Ananth Apr 02 '14 at 02:40
ok, I'll need to recheck the number of sstable files, but last time I did there were only ~5. I do notice a lot of compactions happening when my stress test runs, which should reduce the number of files. – kostas.kougios Apr 02 '14 at 11:51
I found out that I was actually accidentally exhausting the resultset iterator, fixed that and now IO is normal. – kostas.kougios Apr 09 '14 at 19:49

score 1 · Answer 2 · answered Apr 02 '14 at 02:35

1

Once you have done the clustering order , your ordering time is saved now. If you are facing problem with large amounts of data, it will be due to the compaction strategy used. I feel you are using a size tiered compaction strategy on read heavy column family. Try the same scenario with Leveled compaction strategy.

When you use size tiered compaction, you are spreading your data across multiple stables and you are bound to get data out of all each time. So , a read heavy column family doesn't bode well with this.

answered Apr 02 '14 at 02:35

Ananth

971
9
23

Thanks, I'll give it a try (takes time to run a full test). How about the "SSTables per Read" figures above, they seem ok. But also I wonder what happens when I run a query that returns no results (so no key found). Will it try to find the values in all ss-tables? Does that have an IO impact? – kostas.kougios Apr 02 '14 at 09:12
It wont have an I/O impact if you run for a search that returns no results. Bloom filter will automatically reject such cases. For your latest I/O results, it is bound to grow if you are trying to add everything to a single row. Try splitting the data. Wide rows doesnt mean good performance. It just means availability. – Ananth Apr 02 '14 at 13:20
Thanks, that only creates more questions: So does that mean if I run "select * from t where id='x' order by idx desc limit 1" on the above "t" table that has 1.000.000 rows for id='x', it will read a lot of data from the disk? What's the point of idx then been a secondary PK? What's the availability benefit in this case? The data would be available anyway with cassandra, no? – kostas.kougios Apr 02 '14 at 15:31
"The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns, retrieval of rows is very efficient." http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_compound_keys_c.html. – kostas.kougios Apr 02 '14 at 15:40
in my "t" table, I got million of rows with different id's but also got thousands of rows with the same id but different idx. – kostas.kougios Apr 02 '14 at 15:50
t table is now split in 8 sstable files. Read ops do now 80MB/sec. An estimate of 11KB/query. – kostas.kougios Apr 02 '14 at 19:29
hmm, even with leveled compaction, my table is still split into 15 files (after a heavy - write only - recreation of a 16GB database). Tried also nodetool compact but nothing happened. – kostas.kougios Apr 02 '14 at 22:55
If you are going across rows,you data doesn't come from a single node . Each row will be distributed across the nodes. So , the client on cassandra that handles this request , has to combine the results from all nodes and get you the result. I really don't understand your use case. I doubt if your query pattern is optimised for even RDBMS model. – Ananth Apr 03 '14 at 01:32
Levelled compaction creates ssttables in advance. If your ssttables in Leveled compaction is all full , I feel you are inserting across rows i.e. you are growing wide columns and rows at the same time. Don't know in what case you will require to such a problem in real world.Kindly state your use case. Your nosql usage seems a bit awkward in my assumption. – Ananth Apr 03 '14 at 01:36
well, the use case is a user (with id of t.id) posts messages at time idx with the data holding the message. I would like to be able to get the latest messages but also find all past messages (if required). I would like the retrieval of the latest messages (idx desc) to be cheap for user of id. – kostas.kougios Apr 03 '14 at 08:51
I think I found something: by default the datastax driver pre fetches 5000 rows. I just need 1-2, I'll try to modify it's behavior and see what I get – kostas.kougios Apr 03 '14 at 21:40
Can you share that portion of doc that states that? – Ananth Apr 04 '14 at 05:39
Do you mean the default fetch size? I found out during debugging the result set. Please see http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/ResultSet.html#getAvailableWithoutFetching%28%29 , you can set it when i.e. you bind() a prepare statement. Anyway in my case it wasn't what was creating the issue (though it will make my code much simpler), as I anyway had a limit to my queries. The limit means that I was getting 4 rows even if prefetch was by default to 5000. – kostas.kougios Apr 04 '14 at 10:55

score 1 · Answer 3 · answered Apr 09 '14 at 19:54

1

I found out that I was actually accidentally exhausting the resultset iterator, fixed that and now IO is normal.

answered Apr 09 '14 at 19:54

kostas.kougios

945
10
21

Does Cassandra read the whole row when limiting the number of requested results?

3 Answers3

Linked