Cassandra table definition / partitioning / modeling

Question

Trying to define the right schema / table for our scenario: We have few hundreds of eCommerce sites, each one of them has unique siteId.

Each site has it own end-users, up to 10M unique users per month. Each user has unique userId.

Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).

Things to consider:

Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
Events don't have unique ids.
Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
Events have some more metadata and different events (view/purchase/..) have different metadata.
Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.

How to define the key here?

+--------+---------+------------+-----------+-----------+-----------+
| siteId | userId  | timestamp  | eventType | productId | other ... |
+--------+---------+------------+-----------+-----------+-----------+
|      1 | Value 2 | 1501234567 | view      | abc       |           |
|      1 | cols    | 1501234568 | purchase  | abc       |           |
+--------+---------+------------+-----------+-----------+-----------+

My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.

Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites

Highly depends on your needs. How does your read queries look like? Try to design your tables and keys around this queries. Don't be afraid of data deduplication. — TobiSH, Mar 02 '18 at 14:35
To echo what @TobiSH said, it'll be next to impossible to help you without seeing the queries that you are required to serve. — Aaron, Mar 02 '18 at 15:59

TobiSH · Accepted Answer · 2018-03-03T11:53:18.943

If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:

CREATE TABLE user_events (
       userid long,
       timestamp timestamp,
       event_type text,
       site_id long,
       product_id long,
PRIMARY KEY (userid, site_id, timestamp, product_id));

That should make queries like

SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;

quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.

One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.

So an alternative design would look like this:

CREATE TABLE user_events (
       userid long,
       timestamp timestamp,
       event_type text,
       site_id long,
       product_id long,
PRIMARY KEY ( (userid, site_id), timestamp, product_id));

The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?

To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).

Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.

Hope that helps.

I think it was unclear in the question, but users of a different sites might have the same id — ItayB, Mar 03 '18 at 07:34
just to clarify, what's the difference between `( (userid, site_id), timestamp, product_id)` to `( (site_id, userid), timestamp, product_id)`? will it perform in better partitioning (since I have some "heavy" sites and some "light")? — ItayB, Mar 04 '18 at 14:19
@ItayB The tupple (userid, site_id) identifies a row. I *guess* there should be no difference to the tupple (site_id, userid). But there is for sure a difference if you have only the user or only the site_id as the first part of your PK — TobiSH, Mar 05 '18 at 10:31
I'm sure about the latter, but I wonder how the partitioning / balancing works.. never mind - It's for a different question — ItayB, Mar 05 '18 at 12:53
I gues there is no general answer on that. What actually happens is, that cassandra will take the partition key and calculate a hash value out of this partition key which again determines the data locality. So it depends on the hash function how it will actually deal with this keys. Cassandra uses murmur3 (as a default) for that. So you could just try how this behaves on the different keys. — TobiSH, Mar 05 '18 at 21:44

MD Ruhul Amin · Answer 2 · 2018-03-03T07:38:22.540

My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.

So, you want all the events of a given user. As each user has a unique id on a site, so you can form the table using userid and site_id as a primary key and timestamp as clustering key. Here is the table structure:

    CREATE TABLE user_events_by_time (
         userid bigint,
         timestamp timestamp,
         event_type text,
         product_id bigint,
         site_id bigint,
         PRIMARY KEY ((site_id,userid), timestamp)
    ) WITH CLUSTERING ORDER BY (timestamp DESC) ;

Now you can query all of a user's event in a given time by using the following query:

SELECT * from user_events_by_time WHERE site_id= <site_id> and userid = <user_id> and timestamp > <from_time> and timestamp < <to_time>;

Hope this solves your problem.

I think it was unclear in the question, but users of a different sites might have the same id — ItayB, Mar 03 '18 at 07:34
@ItayB updated answer as your comment. Remember you always need to provide site_id and user_id for each query as both are inseparable ;). — MD Ruhul Amin, Mar 03 '18 at 07:39

Cassandra table definition / partitioning / modeling

2 Answers2