I need to optimize how I analyze a rather large data set, and I'm unsure of what the next steps are. I have done a fair bit of MySQL configuration tuning.
I have this InnoDB table:
+----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+----------------+
| id | int(250) | NO | PRI | NULL | auto_increment |
| memory | int(15) | YES | MUL | NULL | |
| q | varchar(250) | YES | MUL | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
| dt | datetime | YES | MUL | NULL | |
| site_id | int(250) | NO | MUL | NULL | |
| execution_time | int(11) | YES | MUL | NULL | |
+----------------+--------------+------+-----+---------+----------------+
Here is a sample of 10 rows:
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| id | memory | q | created | modified | dt | site_id | execution_time |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| 266864867 | 38011080 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:44 | 890 | 1534 |
| 266864868 | 46090184 | node/16432 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:46 | 890 | 840 |
| 266864869 | 50329248 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:05:16 | 890 | 2500 |
| 266864870 | 38011272 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:01 | 890 | 1494 |
| 266864871 | 46087732 | node/16432 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:03 | 890 | 850 |
| 266864872 | 30304428 | node/303 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:12 | 890 | 113 |
| 266864873 | 50329412 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 | 890 | 2465 |
| 266864874 | 28253112 | front_page | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 | 890 | 86 |
| 266864875 | 28256044 | front_page | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:32 | 890 | 81 |
| 266864876 | 38021072 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:55 | 890 | 1458 |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
Here are the table indexes:
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| memories | 0 | PRIMARY | 1 | id | A | 8473766 | NULL | NULL | | BTREE | |
| memories | 1 | index_dt | 1 | dt | A | 1210538 | NULL | NULL | YES | BTREE | |
| memories | 1 | index_execution_time | 1 | execution_time | A | 2344 | NULL | NULL | YES | BTREE | |
| memories | 1 | index_memory | 1 | memory | A | 8473766 | NULL | NULL | YES | BTREE | |
| memories | 1 | index_site_id | 1 | site_id | A | 16 | NULL | NULL | | BTREE | |
| memories | 1 | index_q | 1 | q | A | 338950 | NULL | NULL | YES | BTREE | |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
It stores over a million records for many different sites (site_id). For a given site, there might be 20,000 rows. The information stored is performance metrics for individual page requests. If it matters, non-obvious fields: The memory field is how much memory the script used, q is the path, site_id is a reference to a table Sites.
I have two slow queries that I run on this data. The first gets the 25 most most memory-hog pages:
Select
Memory.q, count(*) as count,
AVG(Memory.memory) as average_memory,
MAX(Memory.memory) as peak_memory,
AVG(Memory.execution_time) as average_execution_time,
MAX(Memory.execution_time) as peak_execution_time
FROM Memory
WHERE site_id = $some_site_id
ORDER BY average_memory DESC
GROUP BY Memory.q
LIMIT 25
The second query gets the the slowest average 25 pages for a given site:
Select
Memory.q, count(*) as count,
AVG(Memory.memory) as average_memory,
MAX(Memory.memory) as peak_memory,
AVG(Memory.execution_time) as average_execution_time,
MAX(Memory.execution_time) as peak_execution_time
FROM Memory
WHERE site_id = $some_site_id
ORDER BY average_execution_time DESC
GROUP BY Memory.q
LIMIT 25
I recently converted the table from MyISAM to InnoDB, so that these reads would not lock the table. This was causing operations that update this table to queue up and lag.
Beyond throwing more ram at the problem (to increase the InnoDB cache size), I want to see if there are other options. I've never worked with a NoSQL database, but from what I understand they won't be of much help here because I use aggregate functions and queries.
The app is written in PHP, if it matters.
Any ideas for a better way to approach the storage and analysis of this data?
Update:
Profiling the query shows the slowness is all in the copying to temp table. I will research how to make this step faster.
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000030 |
| checking query cache for query | 0.000065 |
| Opening tables | 0.000013 |
| System lock | 0.000004 |
| Table lock | 0.000014 |
| init | 0.000032 |
| optimizing | 0.000010 |
| statistics | 0.008119 |
| preparing | 0.000042 |
| Creating tmp table | 0.000317 |
| executing | 0.000005 |
| Copying to tmp table | 5.349280 |
| Sorting result | 0.006511 |
| Sending data | 0.000092 |
| end | 0.000005 |
| removing tmp table | 0.001510 |
| end | 0.000007 |
| query end | 0.000004 |
| freeing items | 0.001163 |
| logging slow query | 0.000006 |
| cleaning up | 0.000006 |
+--------------------------------+----------+
21 rows in set (0.01 sec)
mysql> show profile cpu for query 4;
+--------------------------------+----------+----------+------------+
| Status | Duration | CPU_user | CPU_system |
+--------------------------------+----------+----------+------------+
| starting | 0.000030 | 0.000000 | 0.000000 |
| checking query cache for query | 0.000065 | 0.000000 | 0.000000 |
| Opening tables | 0.000013 | 0.000000 | 0.000000 |
| System lock | 0.000004 | 0.000000 | 0.000000 |
| Table lock | 0.000014 | 0.000000 | 0.000000 |
| init | 0.000032 | 0.000000 | 0.000000 |
| optimizing | 0.000010 | 0.000000 | 0.000000 |
| statistics | 0.008119 | 0.001000 | 0.000000 |
| preparing | 0.000042 | 0.000000 | 0.000000 |
| Creating tmp table | 0.000317 | 0.000000 | 0.000000 |
| executing | 0.000005 | 0.000000 | 0.000000 |
| Copying to tmp table | 5.349280 | 0.687896 | 0.412937 |
| Sorting result | 0.006511 | 0.004999 | 0.001999 |
| Sending data | 0.000092 | 0.000000 | 0.000000 |
| end | 0.000005 | 0.000000 | 0.000000 |
| removing tmp table | 0.001510 | 0.000000 | 0.001000 |
| end | 0.000007 | 0.000000 | 0.000000 |
| query end | 0.000004 | 0.000000 | 0.000000 |
| freeing items | 0.001163 | 0.000000 | 0.001000 |
| logging slow query | 0.000006 | 0.000000 | 0.000000 |
| cleaning up | 0.000006 | 0.000000 | 0.000000 |
+--------------------------------+----------+----------+------------+