4

I have a column family with the secondary index. The secondary index is basically a binary field, but I'm using a string for it. The field called is_exported and can be 'true' or 'false'. After request all loaded rows are updated with is_exported = 'false'.

I'm polling this column table each ten minutes and exporting new rows as they appear.

But here the problem: I'm seeing that time for this query grows pretty linear with amount of data in column table, and currently it takes from 12 to 20 seconds (!!!) to find 5000 rows. From my understanding, indexed request should not depend on number of rows in CF but from number of rows per one index value (cardinality), as it's just another hidden CF like:

    "true" : rowKey1 rowKey2 rowKey3 ...
    "false": rowKey1 rowKey2 rowKey3 ...

I'm using Pycassa to query the data, here the code I'm using:

    column_family = pycassa.ColumnFamily(cassandra_pool, column_family_name, read_consistency_level=2)
    is_exported_expr = create_index_expression('is_exported', 'false')
    clause = create_index_clause([is_exported_expr], count = 5000)
    column_family.get_indexed_slices(clause)

Am I doing something wrong, but I expect this operation to work MUCH faster.

Any ideas or suggestions?

Some config info:

  • Cassandra 1.1.0
  • RandomPartitioner
  • I have 2 nodes and replication_factor = 2 (each server has a full data copy)
  • Using AWS EC2, large instances
  • Software raid0 on ephemeral drives

Thanks in advance!

bigdatarefiner
  • 141
  • 1
  • 7

1 Answers1

1

I don't know the internals of indexing in Cassandra but I'm under the assumption it behaves in a similar fashion to PostgreSQL / MySQL, were indexing boolean, true/false columns is redundant in many scenarios. If cardinality is low (true & false = 2 unique values) and data is distributed quite evenly, e.g. ~50% true and ~50% false, then the database engine will likely perform a full table scan (which doesn't utilize the indexes).

The linear relationship between query execution and data set size would further support that Cassandra is performing a full table (keyspace) scan.

Martin Gallagher
  • 4,444
  • 2
  • 28
  • 26
  • Thanks for the answer, but Cassandra is NoSQL storage and indexes are build in a very different way than Binary Trees in RDBMS. Cassandra's indexes are built on Bloom Filters as all other Column Families. I also have a very biased cardinality, so it always 98-100% of records have "false", and only 2% of records can "true" value, which I change to "false" after each export iteration. – bigdatarefiner Aug 28 '12 at 20:13
  • I'm not sure bloom filters + hash buckets are going to be more performant in this situation when compared to B-Trees. But you're right, a check for "true" where "true" covers 2% of the data set should benefit from a index scan - but again, due to the relationship between data set size and query time I think Cassandra is doing a full scan (its "optimizer" is likely more primitive than an established RDBMS). Also, have you tried to change string "true"|"false" to a boolean primitive? – Martin Gallagher Aug 28 '12 at 20:49