6

TL;DR: I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?

The background:

I have 2 tables

  • person, a table containing informations about people (id, birthdate)
  • works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.

They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.

Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.

I'm trying to get the age of the youngest person in each department, and here is the query I've came up with

SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
    FROM person
    INNER JOIN works_in
        ON works_in.person_id = person.id
    WHERE person.birthdate IS NOT NULL
    GROUP BY works_in.department

The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.

EXPLAIN for this query gives

| id | select_type | table   | type   | possible_keys | key     | key_len | ref                      | rows     | Extra                           | 
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------| 
| 1  | SIMPLE      | works_in| ALL    | NULL          | NULL    | NULL    | NULL                     | 22496409 | Using temporary; Using filesort | 
| 1  | SIMPLE      | person  | eq_ref | PRIMARY       | PRIMARY | 4       | dbtest.works_in.person_id| 1        | Using where                     | 

I built a bunch of indexes for the 2 tables,

/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);

EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.

| id | select_type | table   | type  | possible_keys                                    | key                  | key_len | ref              | rows   | Extra                                                 | 
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------| 
| 1  | SIMPLE      | person  | range | PRIMARY,birthdate                                | birthdate            | 4       | NULL             | 267818 | Using where; Using index; Using temporary; Using f... | 
| 1  | SIMPLE      | works_in| ref   | person,department_id_person,person_department_id | person_department_id | 4       | dbtest.person.id | 3      | Using index                                           | 

However, the execution time of the query has doubled (from ~17s to ~35s).

Why does this makes sense, and what is the correct way to optimize this?

EDIT

Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?

More information concerning the dataset:

  • There are about 5'000'000 records in the person table.
  • Of which only 130'000 have a valid (not NULL) birthdate
  • I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)
Winks
  • 395
  • 2
  • 9

2 Answers2

4

For this query:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;

The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.

By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
GROUP BY wi.department;

For this, the best indexes are person(id, birthdate) and works_in(person_id, department).

EDIT:

I cannot think of an easy way to solve the problem. One solution is more powerful hardware.

If you really need this information quickly, then additional work is needed.

One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.

A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Using the first indexes you proposed (`person(birthdate, id)` and `works_in(person_id, department)`), the query runs in 9s, which is rhoughly 50% of the initial running time. At least it's not an increase in running time anymore! As they are a lot of NULL, using the second set of query/indexes runs in ~30s. Thanks for your input. Any idea on how to improve on this result? – Winks May 30 '15 at 12:18
  • Do you have any idea how many departments there are and how many records have valid birthdates? Also, do you have a `departments` table? – Gordon Linoff May 30 '15 at 14:31
  • Yes, I have this information. I updated the question with this information at the end (10^6 records in `person`, 10^3 valid ones, I have a department table, containing 10^6 records) – Winks May 30 '15 at 14:40
3

Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.

Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.

With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.

When to add an index on a SQL table field (MySQL)?

When to use MyISAM and InnoDB?

https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

Community
  • 1
  • 1
OpenSorceress
  • 2,014
  • 16
  • 21
  • Thanks for pointing the InnoDB/MyISAM issue, but sadly InnoDB is a requirement. I updated my question to make it clearer. – Winks May 30 '15 at 11:08
  • That's often the case. My point is that *indexing itself isn't a panacea.* – OpenSorceress May 30 '15 at 11:13
  • The InnoDB/MyISAM issue is not universally true. And it is backwards in some cases. It is not relevant to Gordon's suggestions. – Rick James Jun 05 '15 at 04:36
  • InnoDB's `PRIMARY KEY` does _not_ need to reach elsewhere to get the data. This makes it faster. A secondary index in InnoDB is likely to be slower because it has to drill down 2 BTrees (versus 1 for MyISAM). A secondary index in InnoDB is _sometimes_ faster because it implicitly includes the `PRIMARY KEY`, thereby allowing for "covering" in extra cases. – Rick James Jun 05 '15 at 04:40
  • I found this question and read "it degrades performance in innodb" as I was looking to optimize a lookup on a huge table with 100,000 rows. Using MySQL 5.7 and innodb as the engine, adding an index to a lookup column increased my lookup speed per query by as much as 1500%. Wanted to leave this anecdote here for anyone who might be persuaded to not do it after reading this. – Allen Gingrich Aug 25 '21 at 21:02