A new project we are working required a lot of data analysis but we are finding this to be VERY slow, we are looking for ways to change our approach with software and or hardware.
We are currently running on a amazon ec2 instance (linux):
High-CPU Extra Large Instance
7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
stepping : 5
cpu MHz : 2133.408
cache size : 4096 KB
MemTotal: 7347752 kB
MemFree: 728860 kB
Buffers: 40196 kB
Cached: 2833572 kB
SwapCached: 0 kB
Active: 5693656 kB
Inactive: 456904 kB
SwapTotal: 0 kB
SwapFree: 0 kB
One part of the db is articles and entities and a link table for example:
mysql> DESCRIBE articles_entities;
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| id | char(36) | NO | PRI | NULL | |
| article_id | char(36) | NO | MUL | NULL | |
| entity_id | char(36) | NO | MUL | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
| relevance | decimal(5,4) | YES | MUL | NULL | |
| analysers | text | YES | | NULL | |
| anchor | varchar(255) | NO | | NULL | |
+------------+--------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
As you can see from the table below we have a lot of assoications growing at a rate of 100,000+ a day
mysql> SELECT count(*) FROM articles_entities;
+----------+
| count(*) |
+----------+
| 2829138 |
+----------+
1 row in set (0.00 sec)
A simple query like the one below is taking too much time (12 secs)
mysql> SELECT count(*) FROM articles_entities WHERE relevance <= .4 AND relevance > 0;
+----------+
| count(*) |
+----------+
| 357190 |
+----------+
1 row in set (11.95 sec)
What should we be considering to improve our lookup times? Different DB storage? Different hardware.