2

I have a site with a bunch of users, and a bunch of "nodes" (content). Each node can be downloaded, and besides the particular node id in question, each download has a "license" associated with it (so a user can download node 5 for 'commercial use' or for 'personal use', etc.), as well as a price for each license.

My goal is to keep track of downloads in such a way that allows me to:

  • Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).
  • Get the total number of downloads for a given node id and license id.
  • Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).
  • Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).

Trivial data to store if optimization doesn't matter, but my issue is one of normalization/optimization for tables that may easily grow to millions of rows. Specifically, assume that:

  • Number of downloads is in the tens of millions.
  • Number of nodes is in the hundreds of thousands.
  • Number of users is in the tens of thousands.

I'm fairly new to any "real" mysql work, so I appreciate your help, and pointing out where I'm being stupid. Here's what I've got so far:

all_downloads table

   +-------------+---------+------------+---------+-----------+-------+
   | download_id | node_id | license_id | user_id | timestamp | price |
   +-------------+---------+------------+---------+-----------+-------+

download_id is a a unique key for this table. This table is a problem, because it could potentially have tens of millions of rows.

downloads_counted table

Instead of adding up the total number of downloads for a given node and license by querying the all_downloads table, the downloads are counted during cron run, and those numbers are stored separately in a downloads_counted table:

   +---------------------------------------------------------------------------+
   | node_id | license_id | downloads_total | downloads_month | downloads_week |  
   +---------------------------------------------------------------------------+

The license id situation is new (formerly there was only one license, so licenses were not tracked in the database), so that's something I'm just trying to figure out how to work with now. In the past, node_id was a unique key for this table. I'm assuming that what I should do now is make the combination of node_id and license_id into a unique primary key. Or is it just as well to leave node_id as the only key for this table, and grab all rows for a given node_id, then parse the results in php (separating or combining downloads for each particular license)? Is it within best practice to have a table with no unique key?

In any case, I think this table is mostly okay, as it shouldn't grow to more than 1 or 2 million rows.

The question of returning downloads for a given user

This is the main area where I need help. I have considered just making the user_id a key in the all_downloads table, and simply querying for all rows that contain a given user_id. But I am concerned about querying this table in the long run, as it will be very large from the start, and could easily grow to tens of millions of rows.

I have considered creating a user_downloads table that would look something like this:

   +---------------------+
   | user_id | downloads | 
   +---------------------+

Where downloads would be a serialized array of node_ids and associated license ids and prices like so (5 is the node_id and would be the index within the top-level array of node_ids):

downloads = array('5' = array(license = array('personal', 'commercial'), price = 25))

I realize storing arrays of data in a single cell is considered bad practice, and I'm not sure that it would improve performance, since the array of downloads could easily grow into the thousands for a given user. However, I'm not sure how to create another table structure that would be more efficient than my all_downloads table at getting the downloads for a given user.

Any and all help is much appreciated!

====================================

Followup questions to Bill Karwin's answer:

  • timestamp is unfortunately going to be a unix timestamp stored in an int(11), rather than a datetime (to conform to Drupal standards). I assume that doesn't really change anything from an optimization standpoint?

  • node_id/license_id/user_id (your idea for a clustered primary key) is not guaranteed to be unique, because users are allowed to download the same node under the same license as many times as they want. This was my primary reason for having a unique download_id for each row... is there a special reason that having a download_id would hurt performance? Or would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?

  • Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant? My thinking is that it would still help performance, since download counts (downloads total, this week, this month, etc.) are going to be showing up very frequently on the site, and the downloads_counted table would have one or two orders of magnitude fewer rows than the all_downloads table.

My idea for the downloads_counted table:

CREATE TABLE downloads_counted (   
 node_id          INT UNSIGNED NOT NULL,   
 license_id       INT UNSIGNED NOT NULL, 
 downloads_total  INT UNSIGNED NOT NULL,  
 downloads_month  INT UNSIGNED NOT NULL,   
 downloads_week   INT UNSIGNED NOT NULL,     
 downloads_day    INT UNSIGNED NOT NULL,  
 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

The secondary key on node_id is for getting all downloads for all licenses for a given node_id... is this key redundant, though, if node_id is already the first part of the compound primary key?

Jordan Magnuson
  • 864
  • 3
  • 10
  • 21

1 Answers1

3

Here's how I would design the table:

CREATE TABLE all_downloads (
  node_id    INT UNSIGNED NOT NULL,
  license_id INT UNSIGNED NOT NULL,
  user_id    INT UNSIGNED NOT NULL,
  timestamp  DATETIME NOT NULL,
  price      NUMERIC (9,2),
  PRIMARY KEY (node_id,license_id,user_id),
  KEY (price)
) ENGINE=InnoDB;

Notice I omitted the download_id.

Now you can run the queries you need to:

  • Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).

    SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456) 
    AND timestamp > NOW() - INTERVAL 30 DAY
    

    This should make good use of the clustered primary index, reducing the set of rows examined until the timestamp comparison only applies to a small subset.

  • Get the total number of downloads for a given node id and license id.

    SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456);
    

    Like the above, this makes use of the clustered primary index. Counting is accomplished by an index scan.

  • Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).

    SELECT COUNT(*) FROM all_downloads WHERE (node_id) = (123);
    

    Ditto.

  • Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).

    SELECT node_id, license_id FROM all_downloads WHERE price = 0 AND user_id = 789;
    

    This reduces the rows examined by using the secondary index on price. Then you take advantage of the fact that secondary indexes in InnoDB implicitly contain the columns of the primary key, so you don't even need to read the base data. This is called a covering index or an index-only query.

As for your other questions:


timestamp ... doesn't really change anything from an optimization standpoint?

I prefer datetime over timestamp only because datetime includes timezone information, and timestamp does not. You can always convert a datetime to a UNIX timestamp integer in a query result, using the UNIX_TIMESTAMP() function.

would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?

The benefit of a clustered key is that the rows are stored in order of the index. So if you query based on node_id frequently, there's a performance advantage to putting that first in the compound clustered index. I.e. if you are interested in the set of rows for a given node_id, it's a benefit that they're stored together because you defined the clustered index that way.

Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant?

Sure, storing aggregate results in a table is a common way to reduce the work of counting up frequently-needed totals so often. But do so judiciously, because it takes some work to keep these totals in sync with the real data. The benefit is greater if you need to read the pre-calculated totals frequently, and multiple times for each time they are updated. Make sure you treat the aggregated totals as less authoritative than the real download data, and have a plan for re-generating the totals when they get out of sync.

Some people also put these aggregates into memcached keys instead of in a table, for even faster lookups. If the volatile data in memcached is lost for some reason, you can re-populate it from the download data.

 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

is this key redundant, though, if node_id is already the first part of the compound primary key?

Yes. MySQL allows you to create redundant indexes, and this is an example of a redundant index. Any query that could use the secondary key on node_id could just as easily use the primary key. In fact, in this case the optimizer will never use the secondary key, because it will prefer the clustered index of the primary key.

You can use pt-duplicate-key-checker to analyze a database for redundant indexes.

Community
  • 1
  • 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    Also not a good practice to assume there's a performance problem without measuring performance. – Mike Sherrill 'Cat Recall' Nov 09 '11 at 02:11
  • Thank you for your incredibly helpful and enlightening response Bill! I feel like I've learned more about sql from your reply than I have from all my recent googling. Anyway, I've added a couple of quick followup questions to my original post, if you have the time and would be so kind to address them. – Jordan Magnuson Nov 09 '11 at 11:50
  • @Catcall I agree. My goal with tackling this downloads table is not to assume that performance is a problem, but rather simply to create the best design I can from the start in a situation that seems likely to be a performance concern... just by looking around I've found numerous cases where people are having performance trouble querying tables with tens of millions of rows. Anyway, my sql knowledge is poor, and I just want to lay good groundwork. My idea of jumping to bad practices before measuring performance was of course a bad one, and I appreciate your calling me out there. – Jordan Magnuson Nov 09 '11 at 11:59
  • @JordanMagnuson: The number one performance problem is bad table design. This table is simple enough that it would be hard to go wrong, and you can rely on Bill Karwin's design--it's the best you'll see at this stage. Later, partitioning might be necessary. But later isn't now. – Mike Sherrill 'Cat Recall' Nov 09 '11 at 13:41
  • Hey Bill, I totally understand if you don't have the interest or desire to follow up any further on this, but I would really appreciate it if you could address my followup questions, even if very briefly. – Jordan Magnuson Nov 10 '11 at 13:34
  • Especially, my 2nd question: node_id/license_id/user_id (your idea for a clustered primary key) is not guaranteed to be unique, because users are allowed to download the same node under the same license as many times as they want. This was my primary reason for having a unique download_id for each row... is there a special reason that having a download_id would hurt performance? Or would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness? – Jordan Magnuson Nov 10 '11 at 13:35
  • @JordanMagnuson: In general, without a natural key you can't tell whether two identical rows refer to two downloads or to one download mistakenly entered twice. That doesn't change if you just hang a download id number on the table. – Mike Sherrill 'Cat Recall' Nov 12 '11 at 12:21
  • BillKarwin: Thank you again for your very helpful answers! @Catcall: Thanks for your input! After thinking about this issue for a while, I think I'm going to get rid of download_id, and not track redundant downloads... since for the most part, all I care about with this table is keeping track of who has downloaded what... rather than how many times. So (node_id/license_id/user_id) will always be unique. The timestamp will be the time of the last download, which can then be used to prevent abuse. I could always keep track of multiple download counts in another table if needed. – Jordan Magnuson Nov 12 '11 at 13:20
  • The disadvantage of this method is that I will need to check if a given (node_id/license_id/user_id) exists before logging each download, but since that's the primary key, I'm thinking that will be an acceptable performance hit. Thanks again! – Jordan Magnuson Nov 12 '11 at 13:22
  • 1
    @JordanMagnuson: Best practice is to just do the insert and trap the error--one round trip to the database. Checking and then inserting requires two round trips, and you still have to write code for the case where the primary key already exists. – Mike Sherrill 'Cat Recall' Nov 12 '11 at 13:48
  • 1
    You can also use `INSERT... ON DUPLICATE KEY UPDATE download_count=download_count+1` – Bill Karwin Nov 12 '11 at 16:39