Count distinct values

Question

consider the following table :

CREATE TABLE users (
  date timestamp,
  user_id text,
  PRIMARY KEY (date, user_id)
);

with the following data for example:

date       user_id

25Aug2013    1
25Aug2013    2
25Aug2013    1
25Aug2013    3

26Aug2013    1
26Aug2013    2

27Aug2013    2
27Aug2013    3
27Aug2013    4

28Aug2013    1
28Aug2013    2
28Aug2013    1
28Aug2013    3

How can I count the number of unique user_id ?

This answer might be very useful: http://stackoverflow.com/a/33223749/7413631 — Marko Švaljek, Apr 18 '17 at 14:16
This type of query not support in cassandra. You can use spark to do this type of query. — Ashraful Islam, Apr 18 '17 at 14:19

score 1 · Answer 1 · answered Apr 18 '17 at 17:30

An idea could be to use a set collection:

CREATE TABLE stats_unique (
  stat_group text,
  user_ids set<text>,
  PRIMARY KEY (stat_group)
);

Inserts will automagically remove the duplicates from the collection, and the select will retrieve all the ids at once, so you count at application level.

If you are only interested in the number of unique user_ids without actually retrieving them from the disks I'm afraid you'll have to change a little more than application code.

And remember to have a deep look into the collections limitations.

score 0 · Answer 2 · edited May 23 '17 at 12:25

In comments I mentioned more or less stuff related to question but I would like to make a remark.

Personally when I was in similar situation with cassandra I abused the properties it has and this is sort of a hack but I figured it might be "useful" in this context.

Basically I created a single side table where I was putting all the unique stuff. i.e.

CREATE TABLE stats_unique (
  stat_group text,
  user_id text,
  PRIMARY KEY (stat_group, user_id)
);

Writes are usually cheap and I had no trouble with additional simple write, after all cassandra was built for this. So every time I inserted to base table I also inserted into the stats_unique table. For your example it would be something like:

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '4');

INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '2');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '1');
INSERT INTO stats_unique (stat_group, user_id) VALUES ('users', '3');

And then when I needed the uniques I just issued a simple req like:

SELECT COUNT(1) FROM stats_unique WHERE stat_group = 'users';

 count
-------
     4

(1 rows)

This is not a standard solution by no means, but it was something that worked in my particular case. Take into account that I couldn't hold more than couple of millions of stuff in this single partition but the system simply didn't have to support that much entity instances so for my use case it was good enough. Also with this hack you might run into problems like timeouts for counting etc.

It would be best to have something on the side to do this count, either separate process, script or even as Ashraful Islam menioned it in his comment a spark process that would do the count for you and put it to some other table in cassandra or other storage technology.

What I used might be cassandra anti pattern (hot row etc.) but it worked for me.

But this means changing the Application code for this query , which I do not want. a valid option can be creating temp method or table but without changing my app , just query the db — igx, Apr 18 '17 at 16:16
I perfectly understand this. Again this was something that was handy for me. But I guess you could create a similar structure and then create a trigger that would insert the data above into `temp table` as you call it. Here is an interesting answer on triggers http://stackoverflow.com/a/35512606/7413631 — Marko Švaljek, Apr 18 '17 at 16:42

Count distinct values

2 Answers2