1

I'd like to create an "activity log" for my PHP site, using one InnoDB MySQL table. Anyone who visits the website will insert a new record into the table, which will record...

  • Their IP address
  • The ID of the account they are signed into (or null, if not signed in)
  • The request they made to the server
  • The date and time the request was received
  • The user-agent sent, but only if a PHP script determines it to be a bot (otherwise null)

At the same time, I can use the table to...

  • Determine the amount of hits each page receives per year/month/day/etc
  • Determine the amount of unique visitors per year/month/day/etc
  • If practical, retrieve the previous information on-the-fly with a PHP script, filtering out bot requests if I'd like to

I have some questions (and some reasoning) for the following table I have in mind:

CREATE TABLE `activity` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `ip` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
  `account` int(11) unsigned DEFAULT NULL,
  `request` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `time` time NOT NULL,
  `year` year(4) NOT NULL,
  `month` tinyint(2) unsigned NOT NULL,
  `day` tinyint(2) unsigned NOT NULL,
  `bot` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
)
  1. Is this style of logging practical? Practical meaning inserts and selects can be performed in milliseconds. I realize plenty of records can be generated this way, but I'm not entirely certain if there's a better way to accomplish everything I want to do.

  2. Moreover, would it be practical to SELECT the number of hits "today" and place it at the bottom of each page served? I'm relatively new to working with larger databases, and I'm still learning what queries would be quick and what queries would be painful.

  3. Should I keep the primary key? I can use it to go back and edit any row I want on a whim (which I really can't see myself doing for an important reason), but will it significantly slow down my INSERTs? Are there any advantages? On the same note, I shouldn't add any more indices for the same reason -- right? Including foreign keys (for the account column)?

  4. Is my alternative of capturing the current date and time practical? I first started this table with a single DATETIME column, but read somewhere that tables like this might benefit from columns dividing the date information; If I want to count the number of hits "today," for example, I could limit my results to...

    WHERE year="2012" AND month="02" AND day="16"
    

    ...in contrast to...

    WHERE date > "2012-02-15 23:59:59"
    

Thanks in advance!

Andreas Wong
  • 59,630
  • 19
  • 106
  • 123
Litty
  • 1,856
  • 1
  • 16
  • 35
  • You could be interested by http://stackoverflow.com/questions/6427786/ip-address-storing-in-mysql-database – Luc M Jun 01 '12 at 01:35
  • 1
    your question seems for open ended/discussion based and may be appropriate for another forum http://stackoverflow.com/faq#dontask – Andy Jones Jun 01 '12 at 01:36
  • The questions could allow discussion, but I feel I've clearly asked a concrete, answerable question: if my proposed table is practical (as I've defined as being able to make queries in milliseconds). – Litty Jun 01 '12 at 01:44
  • Thanks Luc! I knew there was a better way to store IPs, I couldn't find anything on it though. – Litty Jun 01 '12 at 01:44

2 Answers2

2

Your log table seems to be reasonable so far. But instead of the columns time, year, month, day i would definitively use TIMESTAMP which is just four bytes. Then add an index for this column.

If you have queries with range search you should get good response times.

WHERE created >= "2011-01-01 00:00:00"
    AND created < "2012-01-01 00:00:00"
mazatwork
  • 1,275
  • 1
  • 13
  • 20
  • When the table is scanned, wouldn't it take some work to verify if _each row_ falls under that range? Would this be less work than zapping out rows in bulk by year, month, and day? – Litty Jun 01 '12 at 01:47
  • 1
    If you need to query something like MONTH(created) = 1 then you are right. But in practice i think you will often search in a specific range which is first limited by year, then maybe by month and day. If there is an index for the column, the database loves range! – mazatwork Jun 01 '12 at 01:51
  • That makes sense. How about the index, though? The table will be receiving a lot of INSERTs. Will the index slow them down and pose a serious problem? – Litty Jun 01 '12 at 01:53
  • 1
    Since TIMESTAMP is four bytes and your primary key is additional four bytes you should calculate with total of eight bytes extra per data record. I think this is a good ratio compared to the size of your proposed table structure. – mazatwork Jun 01 '12 at 01:57
  • I agree. I'll go ahead and switch to TIMESTAMP. Thanks! – Litty Jun 01 '12 at 02:02
  • 1
    For MySql never apply a function around an indiced column. Instead of YEAR(created) = 2011 use the example in the answer. Otherwise MySql will not use the index (at the time of writing) – mazatwork Jun 01 '12 at 02:06
  • That's helpful to know! I've had a very vague conceptual knowledge of bigger databases, it's good to get help wherever I can get it. – Litty Jun 01 '12 at 02:14
1

This style of logging is practical, but you could benefit from using partitioning (and subpartitioning): http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

Since you're storing an activity log, you'll likely have a very large data set over time. Partitioning could be especially useful since you want to look at specific months and years.

For example, if the data type for the date column is DATE or DATETIME, you could do something like:

PARTITION BY RANGE (MONTH(the_date))
(PARTITION p0 VALUES LESS THAN (0),
 PARTITION p1 VALUES LESS THAN (1),
 PARTITION p2 VALUES LESS THAN (2),
...[and so on up to 12]);

This will give you partitions for each month of data. Once you test with different partitions, try running your queries with "explain partitions select * from..." and you'll be able to see how the query is executed and which partitions are scanned. The best approach to partitioning likely needs to be tailored around your most common queries. For example, are you mainly looking at the last 30 days of data? Monthly snapshots? Custom date ranges? Those could all have an impact on how the partitions are structured.

Taking it a step further, you could have subpartitions in each partition. For this, a hash partition could be used: http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html

Your primary key question will be impacted by partitioning as well. The primary key (and any unique keys) in the table must use every column that are used in partitioning.

cszy
  • 111
  • 3
  • Partitioning looks extremely promising, why have I never heard of this before? Thanks for the input! – Litty Jun 01 '12 at 04:55