Cassandra query on secondary index is very slow

Question

We have a table with about 40k rows, querying on secondary index is slow(30 seconds on production). Our cassandra is 1.2.8. The table schema is as following:

CREATE TABLE usertask (
  tid uuid PRIMARY KEY,
  content text,
  ts int
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX usertask_ts_idx ON usertask (ts);

When I turn on tracing, I notice there is a lot of lines like the following:

Executing single-partition query on usertask.usertask_ts_idx

With only 40k rows, it looks like there are some thousands of query on usertask_ts_idx. What could be the problem? Thanks

More investigation

I try the same query on our test server, it is much faster(30 seconds on prod, 1-2 seconds on test server). After comparing the tracing log, the difference is the time spending at seeking to partition indexed section in data file. On our production it takes 1000-3000 micro sec for each seek, on dev server it takes 100 micro seconds. I guess our production server has not enough memory to cache the data file so it is slow at seeking in data file.

How slow is slow? Just tried this on a single node cluster with 2M rows and it [took my cluster 11862 micro sec](https://gist.github.com/lyubent/7564180). — Lyuben Todorov, Nov 20 '13 at 14:47

score 7 · Answer 1 · answered Nov 25 '13 at 16:45

I am presuming ts is a timestamp, in which case this is not a good candidate for a secondary index. The reason is that it's a high cardinality value (i.e. all values are essentially unique). This means you'll end up with almost one row in the index for each row in usertask--effectively resulting in a join operation. Joins are terribly slow on a distributed database. Since you haven't shown your query I'm not sure exactly what you're doing, but you'll need to rethink your model if you want to query based on time.

Cassandra query on secondary index is very slow

More investigation

1 Answers1

Linked