My data model is really simple. It models visits to web pages.
This is what my Visit model looks like (the syntax is express-cassandra schema syntax):
fields: {
id: {
type: 'uuid',
rule: {
required: true,
message: 'id is required'
}
},
userId: {
type: 'int',
rule: {
required: true,
message: 'userId is required'
}
},
dateOfVisit: {
type: 'timestamp',
rule: {
required: true,
message: 'dateOfVisit is required'
}
},
urlPort: 'int',
urlHost: {
type: 'text',
rule: {
required: true,
message: 'urlHost is required'
}
},
urlPath: 'text',
urlQuery: 'text',
urlProtocol: {
type: 'text',
rule: {
required: true,
message: 'urlProtocol is required'
}
},
urlHash: 'text',
pageTitle: 'text'
},
key: [['id'], 'dateOfVisit'],
clustering_order: {'dateOfVisit': 'desc'}
I have a few questions about this model:
Question #1:
I'm pretty sure I want to store the parts of the URL instead of storing the whole URL as a single string because it allows me to more easily run queries for visits to a specific domain, visits to a specific path within a domain, visits to secure pages vs. visits to insecure pages, links from secure pages to insecure pages (or the inverse), etc.
But, would it be better to store the parts of the URL as A) individual columns or B) as a single Map column.
Also, will I have to create additional tables with different primary keys just to support all the various queries from querying on different parts of the url?
Question #2
I'm going to have a number of different ways that I'm going to want to query the data.
- Get all visits across all users
- Get all visits for a single user
- Get all visits on a given day or bucketed by hour within a given day
- Get all visits to a given domain
- Count all visits to a given domain grouped by path
So, given the various different types of queries, how should I store this model?
I currently have essentially the exact same fields stored in multiple tables just with different primary keys (one table with just (id) to support "get all visits across all users", another table with (id, userId) to support "get all visits for a specific user", etc.
This just feels like it creates multiple copies of the data just to support essentially the same query but with one additional condition to the where clause.
Is there a better way to model this?