3

A new project we are working required a lot of data analysis but we are finding this to be VERY slow, we are looking for ways to change our approach with software and or hardware.

We are currently running on a amazon ec2 instance (linux):

High-CPU Extra Large Instance

7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge


processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
stepping        : 5
cpu MHz         : 2133.408
cache size      : 4096 KB

MemTotal:      7347752 kB
MemFree:        728860 kB
Buffers:         40196 kB
Cached:        2833572 kB
SwapCached:          0 kB
Active:        5693656 kB
Inactive:       456904 kB
SwapTotal:           0 kB
SwapFree:            0 kB

One part of the db is articles and entities and a link table for example:

mysql> DESCRIBE articles_entities;
+------------+--------------+------+-----+---------+-------+
| Field      | Type         | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| id         | char(36)     | NO   | PRI | NULL    |       | 
| article_id | char(36)     | NO   | MUL | NULL    |       | 
| entity_id  | char(36)     | NO   | MUL | NULL    |       | 
| created    | datetime     | YES  |     | NULL    |       | 
| modified   | datetime     | YES  |     | NULL    |       | 
| relevance  | decimal(5,4) | YES  | MUL | NULL    |       | 
| analysers  | text         | YES  |     | NULL    |       | 
| anchor     | varchar(255) | NO   |     | NULL    |       | 
+------------+--------------+------+-----+---------+-------+
8 rows in set (0.00 sec)

As you can see from the table below we have a lot of assoications growing at a rate of 100,000+ a day

mysql> SELECT count(*) FROM articles_entities;
+----------+
| count(*) |
+----------+
|  2829138 | 
+----------+
1 row in set (0.00 sec)

A simple query like the one below is taking too much time (12 secs)

mysql> SELECT count(*) FROM articles_entities WHERE relevance <= .4 AND relevance > 0;
+----------+
| count(*) |
+----------+
|   357190 | 
+----------+
1 row in set (11.95 sec)

What should we be considering to improve our lookup times? Different DB storage? Different hardware.

Lizard
  • 43,732
  • 39
  • 106
  • 167
  • is your table properly indexed? – Anush Prem Jan 20 '11 at 12:06
  • Isn't that obvious from the table dump provided? – Lizard Jan 20 '11 at 12:12
  • MyISAM or InnoDB table, MyIsam is much faster.. – devasia2112 Jan 20 '11 at 12:17
  • 3
    Have you considered using something else than char for the ids? On a PostgreSQL setup that I did with inner joins between tables with +20 million records with +500 million records, using numeric ids made a significant difference. It was way faster. – John P Jan 20 '11 at 12:19
  • can you post a few other typical queries along with their explain plans and also show indexes from so we can see cardinalities. – Jon Black Jan 20 '11 at 12:26
  • @Fernando - perhaps myisam could make this faster ? but as it doesnt support clustered indexes i doubt it ! http://stackoverflow.com/questions/4419499/mysql-nosql-help-me-to-choose-the-right-one-on-a/4421601#4421601 – Jon Black Jan 20 '11 at 12:31
  • It is MyIsam, I will look into switching to int ids – Lizard Jan 20 '11 at 12:39
  • perhaps you need to look for a faster type table or move to oracle, since oracle has purchased the mysql.. correct me if I'm wrong. Anyway good luck on your decision. – devasia2112 Jan 20 '11 at 12:59
  • 2
    Could you please post an EXPLAIN for the slow SELECT you provided? – 0xCAFEBABE Jan 20 '11 at 13:10
  • 3
    Perhaps your first step should be to hire a database professional with experience in optimizing large systems. Clearly from your design (A char(36) PK?) and the question you have no such person. High performance large database design and tuning should not be done by amateurs. This is not something that can be fixed from questions on a board such as this. – HLGEM Jan 20 '11 at 14:22

3 Answers3

3

As mrorigo asked, please provide the SHOW CREATE TABLE articles_entities so we can see the actual indexes of your table.

As a note from MySQL documentation http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. 
For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

MySQL cannot use an index if the columns do not form a leftmost prefix of the index

So if relevance is part of a multi-column index, but isn't the leftmost column of that index, then the index is not used for your query.

This is a common issue that is often overlooked.

YoGiN
  • 127
  • 4
2

Using char(36) for keys is not the fastest you can do with MySQL. Use INT-types for keys if possible. If you index CHAR columns, the indexes will be VERY large compared to an (BIG)INT index (if not 'properly' created)

However, if your column values are not numeric, you are stuck with CHAR columns (which ARE still faster than VARCHAR, but can create large indexes).

Please provide a SHOW CREATE TABLE of tables to see key/index parameters, and also as the previous answer said, an EXPLAIN for the queries in question could help provide a better answer.

PS. Use SHOW TABLE STATUS LIKE '{table_name}' to see index (and data) sizes of the table.

origo
  • 523
  • 3
  • 6
1

There are three things that matter when it comes to query performance:

Indexes. Memory. Everything else.

The first thing to do is check your indexes. Do an EXPLAIN on your queries to find out how MySQL is processing them.

If that looks sensible, the next thing would be to check memory. How big is your total database? Memory is cheap these days, and queries that run from memory will be much, much faster than queries that have to read from disk.

After you've explored those, if performance is still slow, then it might be time to consider other options.

Pixy Misa
  • 106
  • 3
  • Yup all of the above done, hence the question, can you offer any pointers? – Lizard Jan 20 '11 at 12:11
  • Even before discussing indexes, we need to know about disk i/o. For the query that took 12 sec, how many disk i/os did it take? What was the query strategy used by the DBMS? Was it a full table scan? From there we can go to index strategy. – Walter Mitty Jan 20 '11 at 14:35