0

I am struggling with data order of Cassandra data. I have a table like this

tbl_data
 - yymmddhh (text)
 - data (text)

parting key is 'yymmddhh'

I am adding data like this

'16-11-17-01', 'a'
'16-11-17-01', 'b'
'16-11-17-02', 'c'
'16-11-17-03', 'xyz'
'16-11-17-03', 'e'
'16-11-17-03', 'f'

select * from tbl_data limit 10;

I am expecting data in the order in which I added data. But it is giving data like this

'16-11-17-03', 'f'
'16-11-17-03', 'e'
'16-11-17-01', 'a'

i.e. latest record first or some random order. I need data in the same order in which I added. I am not able to figure out the default order of the data in my case. Also I don't want to pass partition key in where condition because its overhead to remember that value for me. Kindly suggest me the solution.

Mohd Shahid
  • 1,538
  • 2
  • 33
  • 66
  • In Cassandra, You can't expecting order data by partition key or the order in which you added data. Cassandra order data by clustering column – Ashraful Islam Nov 17 '16 at 09:17
  • Cassandra will not work for your use case. You MUST filter by a partition key to enforce any kind of meaningful sort order on your result set. Full explanation can be found here: http://www.datastax.com/dev/blog/we-shall-have-order – Aaron Nov 17 '16 at 14:31

1 Answers1

4

I'm afraid you will struggle forever on this.

As per comments, you can't decide the order "outside" a partition, unless you really understand what you're doing by changing the partitioner.

Please have a read at the suggested link, and at this and this SO answers to understand why you are getting your records in this specific order (yes, they ARE ordered...).

A possible solution, however, is to add a timestamp clustering key, and change the partition key to a simpler "yymmdd":

tbl_data
 - yymmdd (timestamp)
 - hhmmssMMM (timestamp)
 - data (text)

Now you'd store data on day by day basis (that is you need to know the day you are querying data for), and the order of your data inside each partition (that is each day) is sorted by the timestamp column, so for your requirements you'd store there the insertion time of the record.

Now, if you don't insert data every day, you really need to keep track the insertion dates into another (very simple) table:

CREATE TABLE inserted_days (
    yymmdd timestamp PRIMARY KEY
);

Issuing a

SELECT * FROM inserted_days

would scan all this partition, returning records in random order (from you app point of view, so you need to sort it), but here we are talking of 365 records in year, something you don't need to worry about. It's easy to do and you'd not incur into unmanageable overheads.

HTH.

Community
  • 1
  • 1
xmas79
  • 5,060
  • 2
  • 14
  • 35