0

Lets say we have a key-space named sensors and a table named sensor_per_row. this table has the following structure :

sensor_id | ts | value

In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.

 select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row

The outcome of this select is

  sensor_id | value | day       | ts

 -----------+-------+------------+---------------

  Sensor 2 |  52.7 | 2019-01-04 | 1546640464138

  Sensor 2 |  52.8 | 2019-01-04 | 1546640564376

  Sensor 2 |  52.9 | 2019-01-04 | 1546640664617

How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :

 sensor_id | system.avg(value) | day
-----------+-------------------+------------
  Sensor 2 |          52.52059 | 2018-12-11
  Sensor 2 |          42.52059 | 2018-12-10
  Sensor 3 |          32.52059 | 2018-12-11

One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ? Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?

2 Answers2

0

NoSQL Limitations

While working with NoSQL, we generally have to give up:

  1. Some ACID guarantees.
  2. Consistency from CAP.
  3. Shuffling operations: JOIN, GROUP BY.

You may perform above operations by reading data(rows) from the table and summing.

You can also refer to the answer MAX(), DISTINCT and group by in Cassandra

Raj Parekh
  • 192
  • 1
  • 9
  • Yes i have already read this topic , in the last version off Cassandra (v.3 ++) Group by is included and indeed Groups the data . But the question here is how to get the average from the grouped by date data . I guess the only solution is using a back end language and processing the data there . Anyway thanks for your answer , appreciate it :) – AbstactVersion Jan 05 '19 at 16:20
0

So I found the solution , I will post it in case somebody else has the same question. As I read the data modeling seems to be the answer. Which means :

In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .

For instance lets say we have the log schema {sensor_id , region , value} , The first comes in mind is to generate a table called sensor_per_row like :

    sensor_id | value | region     | ts

   -----------+-------+------------+---------------

This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them . Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .

For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.

If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.

So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row

I hope i gave you the idea.