I have a site with a bunch of users, and a bunch of "nodes" (content). Each node can be downloaded, and besides the particular node id in question, each download has a "license" associated with it (so a user can download node 5 for 'commercial use' or for 'personal use', etc.), as well as a price for each license.
My goal is to keep track of downloads in such a way that allows me to:
- Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).
- Get the total number of downloads for a given node id and license id.
- Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).
- Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).
Trivial data to store if optimization doesn't matter, but my issue is one of normalization/optimization for tables that may easily grow to millions of rows. Specifically, assume that:
- Number of downloads is in the tens of millions.
- Number of nodes is in the hundreds of thousands.
- Number of users is in the tens of thousands.
I'm fairly new to any "real" mysql work, so I appreciate your help, and pointing out where I'm being stupid. Here's what I've got so far:
all_downloads table
+-------------+---------+------------+---------+-----------+-------+
| download_id | node_id | license_id | user_id | timestamp | price |
+-------------+---------+------------+---------+-----------+-------+
download_id is a a unique key for this table. This table is a problem, because it could potentially have tens of millions of rows.
downloads_counted table
Instead of adding up the total number of downloads for a given node and license by querying the all_downloads table, the downloads are counted during cron run, and those numbers are stored separately in a downloads_counted table:
+---------------------------------------------------------------------------+
| node_id | license_id | downloads_total | downloads_month | downloads_week |
+---------------------------------------------------------------------------+
The license id situation is new (formerly there was only one license, so licenses were not tracked in the database), so that's something I'm just trying to figure out how to work with now. In the past, node_id was a unique key for this table. I'm assuming that what I should do now is make the combination of node_id and license_id into a unique primary key. Or is it just as well to leave node_id as the only key for this table, and grab all rows for a given node_id, then parse the results in php (separating or combining downloads for each particular license)? Is it within best practice to have a table with no unique key?
In any case, I think this table is mostly okay, as it shouldn't grow to more than 1 or 2 million rows.
The question of returning downloads for a given user
This is the main area where I need help. I have considered just making the user_id a key in the all_downloads table, and simply querying for all rows that contain a given user_id. But I am concerned about querying this table in the long run, as it will be very large from the start, and could easily grow to tens of millions of rows.
I have considered creating a user_downloads table that would look something like this:
+---------------------+
| user_id | downloads |
+---------------------+
Where downloads would be a serialized array of node_ids and associated license ids and prices like so (5 is the node_id and would be the index within the top-level array of node_ids):
downloads = array('5' = array(license = array('personal', 'commercial'), price = 25))
I realize storing arrays of data in a single cell is considered bad practice, and I'm not sure that it would improve performance, since the array of downloads could easily grow into the thousands for a given user. However, I'm not sure how to create another table structure that would be more efficient than my all_downloads table at getting the downloads for a given user.
Any and all help is much appreciated!
====================================
Followup questions to Bill Karwin's answer:
timestamp is unfortunately going to be a unix timestamp stored in an int(11), rather than a datetime (to conform to Drupal standards). I assume that doesn't really change anything from an optimization standpoint?
node_id/license_id/user_id (your idea for a clustered primary key) is not guaranteed to be unique, because users are allowed to download the same node under the same license as many times as they want. This was my primary reason for having a unique download_id for each row... is there a special reason that having a download_id would hurt performance? Or would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?
Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant? My thinking is that it would still help performance, since download counts (downloads total, this week, this month, etc.) are going to be showing up very frequently on the site, and the downloads_counted table would have one or two orders of magnitude fewer rows than the all_downloads table.
My idea for the downloads_counted table:
CREATE TABLE downloads_counted (
node_id INT UNSIGNED NOT NULL,
license_id INT UNSIGNED NOT NULL,
downloads_total INT UNSIGNED NOT NULL,
downloads_month INT UNSIGNED NOT NULL,
downloads_week INT UNSIGNED NOT NULL,
downloads_day INT UNSIGNED NOT NULL,
PRIMARY KEY (node_id, license_id),
KEY (node_id)
) ENGINE=InnoDB;
The secondary key on node_id is for getting all downloads for all licenses for a given node_id... is this key redundant, though, if node_id is already the first part of the compound primary key?