Trying to define the right schema / table for our scenario:
We have few hundreds of eCommerce sites, each one of them has unique siteId
.
Each site has it own end-users, up to 10M unique users per month. Each user has unique userId
.
Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).
Things to consider:
- Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
- Events don't have unique ids.
- Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
- Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
- Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
- Events have some more metadata and different events (view/purchase/..) have different metadata.
- Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.
How to define the key here?
+--------+---------+------------+-----------+-----------+-----------+ | siteId | userId | timestamp | eventType | productId | other ... | +--------+---------+------------+-----------+-----------+-----------+ | 1 | Value 2 | 1501234567 | view | abc | | | 1 | cols | 1501234568 | purchase | abc | | +--------+---------+------------+-----------+-----------+-----------+
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites