MySQL select distinct optimization

Question

Suppose I have the following table in MySQL:

CREATE TABLE `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
  `time_stamp` bigint(20) unsigned NOT NULL,
  `event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
  `value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
  `value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `value_count` bigint(20) DEFAULT NULL,
  `alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `alarm_severity` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;

Is there any way to improve the following query with indexes or otherwise?

SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > t0_in AND events.time_stamp < t1_in
AND (events.value IS NULL OR events.alarm_severity = 'INVALID');

t0_in and t1_in are parameters passed to the stored procedure the query is defined in.

Running the query with EXPLAIN gives:

+----+-------------+--------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows     | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | SIMPLE      | events | index | PRIMARY       | PRIMARY | 250     | NULL | 12724016 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+----------+-------------+

Running the query on the database returns 102620 rows in 1 min 50.93 sec.

UPDATE

Suppose for simplicity that the table is the following:

CREATE TABLE `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
  `time_stamp` bigint(20) unsigned NOT NULL,
  `value_valid` tinyint(1) NOT NULL,
  PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;

Is it possible to add appropriate indexes so that the following or an equivalent query uses loose index scan optimization?

SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > t0_in AND events.time_stamp < t1_in
AND events.value_valid = 0);

UPDATE

If I add an index on time_stamp I get:

mysql> EXPLAIN SELECT DISTINCT events.pv_name FROM events WHERE events.time_stamp > 0 AND events.time_stamp < 11426224880000000000 AND (events.value IS NULL OR events.alarm_severity = 'INVALID');
+----+-------------+--------+-------+--------------------+---------+---------+------+----------+-------------+
| id | select_type | table  | type  | possible_keys      | key     | key_len | ref  | rows     | Extra       |
+----+-------------+--------+-------+--------------------+---------+---------+------+----------+-------------+
|  1 | SIMPLE      | events | index | PRIMARY,time_stamp | PRIMARY | 250     | NULL | 13261211 | Using where |
+----+-------------+--------+-------+--------------------+---------+---------+------+----------+-------------+

Running this query on the database returns 11511 rows in 30.44 sec.

mysql> EXPLAIN SELECT DISTINCT events.pv_name FROM events FORCE INDEX (time_stamp) WHERE events.time_stamp > 0 AND events.time_stamp < 11426224880000000000 AND (events.value IS NULL OR events.alarm_severity = 'INVALID');
+----+-------------+--------+-------+--------------------+------------+---------+------+---------+-----------------------------------------------------+
| id | select_type | table  | type  | possible_keys      | key        | key_len | ref  | rows    | Extra                                               |
+----+-------------+--------+-------+--------------------+------------+---------+------+---------+-----------------------------------------------------+
|  1 | SIMPLE      | events | range | PRIMARY,time_stamp | time_stamp | 8       | NULL | 6630605 | Using index condition; Using where; Using temporary |
+----+-------------+--------+-------+--------------------+------------+---------+------+---------+-----------------------------------------------------+

Running this query on the database returns 11511 rows in 2 min 20.41 sec.

UPDATE

From the suggestions I have changed the table to:

CREATE TABLE `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
  `time_stamp` bigint(20) unsigned NOT NULL,
  `event_type` enum('add','init','update','disconnect','remove') COLLATE utf8mb4_unicode_ci NOT NULL,
  `value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
  `value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `value_count` bigint(20) DEFAULT NULL,
  `alarm_status` enum('NO_ALARM','READ','WRITE','HIHI','HIGH','LOLO','LOW','STATE','COS','COMM','TIMEOUT','HWLIMIT','CALC','SCAN','LINK','SOFT','BAD_SUB','UDF','DISABLE','SIMM','READ_ACCESS','WRITE_ACCESS') COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `alarm_severity` enum('NO_ALARM','MINOR','MAJOR','INVALID') COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`pv_name`,`time_stamp`),
  KEY `event_type` (`event_type`,`time_stamp`),
  KEY `alarm_severity` (`alarm_severity`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;

and the query to:

SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 1426224880000000000
AND alarm_severity = 'INVALID'
UNION
SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 1426224880000000000
AND event_type = 'add'
UNION
SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 1426224880000000000
AND event_type = 'disconnect'
UNION
SELECT DISTINCT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 1426224880000000000
AND event_type = 'remove';

Running explain on the query gives:

+----+--------------+----------------+-------+-----------------------------------+----------------+---------+------+--------+-------------------------------------------+
| id | select_type  | table          | type  | possible_keys                     | key            | key_len | ref  | rows   | Extra                                     |
+----+--------------+----------------+-------+-----------------------------------+----------------+---------+------+--------+-------------------------------------------+
|  1 | PRIMARY      | events         | range | PRIMARY,event_type,alarm_severity | alarm_severity | 10      | NULL | 101670 | Using where; Using index; Using temporary |
|  2 | UNION        | events         | range | PRIMARY,event_type,alarm_severity | event_type     | 9       | NULL | 994652 | Using where; Using index; Using temporary |
|  3 | UNION        | events         | range | PRIMARY,event_type,alarm_severity | event_type     | 9       | NULL |  73660 | Using where; Using index; Using temporary |
|  4 | UNION        | events         | range | PRIMARY,event_type,alarm_severity | event_type     | 9       | NULL | 136348 | Using where; Using index; Using temporary |
| NULL | UNION RESULT | <union1,2,3,4> | ALL   | NULL                              | NULL           | NULL    | NULL |   NULL | Using temporary                           |
+----+--------------+----------------+-------+-----------------------------------+----------------+---------+------+--------+-------------------------------------------+

Running the query on the database returns 112620 rows in 1 min 2.45 sec.

Could you provide a sqlfiddle with a bit of data to be play with ? — Loufylouf, Mar 21 '16 at 20:25
The table currently has approximately 12,000,000 rows and will steadily grow. — Patrick, Mar 21 '16 at 20:30
@Loufylouf: I'm not well versed in sqlfiddle. Would it be representative without a large number of rows in the table? — Patrick, Mar 21 '16 at 20:35
It will be better than trying to do this manually, and the explain still will work, so it won't be as significant, but still useful. — Loufylouf, Mar 21 '16 at 20:41

score 1 · Answer 1 · edited May 23 '17 at 12:07

Without much data on your data, this won't be very specific but I hope you will still find it useful.

Indexes and RAM

To keep best performances, you should always make sure that your index(es) can fit in your RAM. That may often be the case, but when the tables begin to be in the order of magnitude of millions of rows, it's worth a look. You can find quite a lot of info about how to so on this SO question . Why is it important ? Well, I don't know how it works internally, but there's quite a chance that the indexes will be stored on the hard drive, which is gonna be sooooooooo loooooong. Or it can also flush the first part of indexes then load the remaining in RAM etc etc. Anyway, it will be long, and if you can avoid it simply (by increasing the RAM the engine can use), do so.

Partitioning

You already use a primary key which is a good thing, but you could also use partitioning. The idea is pretty simple, instead of storing this in a single table, it will automatically the equivalent of sub-tables that will only contain some ranges of values (it's a bit more complicated than that, but let's say range of values for now). It will all be transparent for you when using SELECT, UPDATE or DELETE, so there's no refactoring involved for your requests. I recommend having a look at this very neat presentation about partitions. The documentation is also really great on that aspect. You will see for example that you can use partitions of different sizes. For example, if you partition according to the timestamp and you know that the most recent data is accessed way more often than old data, you can create like 7 partitions for the last 7 days, then 4 partitions for the previous 4 weeks, then 12 partitions for the last 12 month etc etc. But that requires some analysis on your end.

Better keys

For the previous point and also because it's way cleaner, I highly recommend to change the bigint type of your timestamp to a real date/time mysql type as @Huy Nguyen suggested. As an end note, his remark about alarm_status and alarm_severity is a good one, if this only can take a set of defined values, you should switch to int type, which will let you use them more efficient in keys and partitions.

Update

Concerning your update, I'm not really well-versed in loose index scan optimization, but adding a key on value_valid, time_stamp seems to reduce the number of rows used (from the explain command) and is systematically the key chosen (rather than the primary key already defined). I have quite a reduced set of data, so it would be worth trying on your data. To talk numbers, with just the primary key you have defined, on a sample query I have : key_len: 250, rows:242, with my additionnal key : key_len:9, rows:106

Thank you. The `time_stamp` encodes GPS time in nanoseconds, so I think it has to be a bigint. I can probably change the `event_type`, `value_type`, `alarm_status` and `alarm_severity` to enumerated types. I do actually plan on adding partitions in the manner you describe. I was kind of hoping there would be some way to formulate the query to use loose index scan optimization, but maybe that is not possible? — Patrick, Mar 21 '16 at 22:44
I was changing other things, so I'm not certain, but it appears that increasing the 'innodb buffer pool size' to 8 GB has helped bring the query time down to around 18 seconds. — Patrick, Mar 22 '16 at 00:24
That's great, it's almost one order of magnitude better. Concerning your timestamp, do you really need the nanosecond precision ? MySQL seems to be able to store date/time values down to the [microsecond](http://dev.mysql.com/doc/refman/5.7/en/fractional-seconds.html), and you could always store the nanosecond part in a separate column if you don't use it that often. — Loufylouf, Mar 22 '16 at 06:58
Unfortunately I do need the nanosecond precision. I am recording events that may have the same time otherwise. — Patrick, Mar 22 '16 at 16:21

score 0 · Answer 2 · answered Mar 21 '16 at 20:26

0

You should add an index on

events.time_stamp

and could be useful also an index on

events.alarm_severity

answered Mar 21 '16 at 20:26

ScaisEdge

131,976
10
91
107

score 0 · Answer 3 · answered Mar 21 '16 at 20:37

0

Also add

INCLUDE (events.pv_name)

to the index so it doesnt do a table scan

answered Mar 21 '16 at 20:37

Nitramuteh

1
1

I'm sorry, I'm not sure I understand. `events.pv_name` is already in the primary key. – Patrick Mar 21 '16 at 20:40
That's not a MySQL feature. – Rick James Mar 22 '16 at 18:15

score 0 · Answer 4 · answered Mar 21 '16 at 21:53

Some possible tips, in order of theoretical improvements:

Try MYSQL locking the table before the SELECT and then UNLOCK TABLES. I believe locking tables do speed it up since it does not have to worry about updates to the table as it selects, and therefor can grab data more efficiently.

I think using it in a BEGIN/COMMIT transaction sequence may help speed in some cases, but usually with INSERT/UPDATES instead of SELECTS.
It may help to make these indexes as well: time_stamp, value, alarm_severity.
If possible, change alarm_severity from varchar(40) to char(40). CHAR is faster to search than VARCHAR, though takes up more space. Or change alarm_severity into an integer instead of a string so that can be indexed even faster. Or add an additional field that is an integer counterpart like alarm_severity_code, which would be faster to index and search.
The index you create for alarm_severity can be limited to just 10 characters or so. It will make the searching faster I believe (depending on your dataset), but still allow up to 40 chars for that field. If those values are like 'INVALID', then 10 should be good enough to index on.
Maybe add a "has_value" field which is indexable, instead of searching for value where is NULL, since value is not indexable. This needs to be assigned a value as you add/edit the records.
Does time_stamp really need to be a big_int? It may be more efficient just to use the timestamp data type.
Does it have to be ROW_FORMAT=COMPRESSED? That sounds like it would slow it down to uncompress the data when it queries it.

So the suggested table structure may be like:

CREATE TABLE IF NOT EXISTS `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
  `time_stamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
  `has_value` int(11) NOT NULL DEFAULT '0',
  `value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
  `value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `value_count` bigint(20) DEFAULT NULL,
  `alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `alarm_severity` char(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`pv_name`,`time_stamp`),
  KEY `time_stamp` (`time_stamp`),
  KEY `alarm_severity` (`alarm_severity`(10)),
  KEY `has_value` (`has_value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;

It does come down a lot to disk space versus speed. You can also break up your dataset into separate tables, one for certain value or certain alarm_severity, so each query can be on a smaller table.

Thank you. The time_stamp encodes GPS time in nanoseconds, so I think it has to be a bigint. I can probably change the `event_type`, `value_type`, `alarm_status` and `alarm_severity` to enumerated types. I was considering adding a boolean `is_valid` column to search on instead of checking the `value` and `alarm_severity` columns. I have been trying to formulate the query and indexes to use loose index scan optimization, but maybe this is not possible? — Patrick, Mar 21 '16 at 23:00
Sorry, but I feel compelled to down-vote this. There are too many bits of mis-information. I have responded to most of them in the Rebuttal section of my Answer. — Rick James, Mar 22 '16 at 18:24

score 0 · Answer 5 · answered Mar 22 '16 at 18:22

Performance improvement

The "index scan" is against PRIMARY, so it is really a table scan, which is about the slowest possible way.

You need

INDEX(time_stamp)

The PRIMARY KEY(pv_name, time_stamp) is not useful because the leading field(s) (pv_name) are not helpful for the WHERE or GROUP BY or ORDER BY.

Caveat: If it fails to switch to the new index, you may need to use "prepare" in the SP.

An index on alarm_severity will not help because it is hidden in OR.

You could swap the order of the fields in the PRIMARY KEY, but this might hurt other queries, and it would take a long time to do the ALTER.

Cookbook on creating indexes.

Better improvement (except that it won't work)

Because of OR, this part of the WHERE cannot be optimized:

AND (events.value IS NULL OR
     events.alarm_severity = 'INVALID')

There is one hope: Turn the OR into UNION:

      ( SELECT  DISTINCT events.pv_name
            FROM  events
            WHERE  events.time_stamp > t0_in
              AND  events.time_stamp < t1_in
              AND  events.value IS NULL 
      )
    UNION  DISTINCT 
      ( SELECT  DISTINCT events.pv_name
            FROM  events
            WHERE  events.time_stamp > t0_in
              AND  events.time_stamp < t1_in
              AND  events.alarm_severity = 'INVALID' 
      );

and add

INDEX(alarm_severity, time_stamp) -- in that order
INDEX(value , time_stamp) -- in that order

BUT -- and it is a big BUT -- that won't work because value is TEXT. If value could be changed to VARCHAR(191), then it would work. Even better would be ENUM. (No, "prefix index" is not smart enough.)

Rebuttals

Yeah, indexes should fit in RAM. But often you don't have a choice.

PARTITIONing is rarely useful. I don't see it beneficial in this case.

I can probably change the event_type, value_type, alarm_status and alarm_severity to enumerated types.

Do it! Assuming this is a very large table, that would shrink the table size a lot, there by making it faster -- especially if it is I/O bound now.

PARTITIONs of different sizes -- That is nice, but there is a "problem" when you need to roll 4 weeks into 1 month (or whatever). It effectively blocks activity during the consolidation. And, since having no more than about 50 partitions is desirable for other (performance) reasons, rolling up will eventually be 'necessary'.

innodb_buffer_pool_size should be set to about 70% of available RAM. This is the most important tunable.

Nanosecond -- Check the data; I suspect you have dups. Granted that should be precise enough, but what is the algorithm that provides the clock? It could be allowing dups. (I'm less worried about the 8 bytes for it.)

For InnoDB, use BEGIN...COMMIT where appropriate for transactional integrity. Do not use LOCK TABLES.

Individual indexes on value and alarm_severity are useless for this query. (Yet time_stamp is useful.)

"change varchar(40) to char(40)" -- NO! There is almost no case where CHAR is better. And not in this case.

KEY alarm_severity (alarm_severity(10)) -- Prefix indexes are almost never beneficial. Especially when it is VARCHAR and the value is usually short anyway.

I've updated the question to report on adding an index to `time_stamp`. Unfortunately it appears to run slower. Since it is looking for distinct values of `pv_name`, I would have guessed that an index on `pv_name` should be used? — Patrick, Mar 23 '16 at 01:14
Indexes are used for filtering and/or ordering, not for values. — Rick James, Mar 23 '16 at 05:15
Since it went slower, I suspect that the Optimizer failed to do the obvious and use that new index. Change from simply executing the `SELECT` into using `CONCAT` to construct the `SELECT` with the `t0_in` and `t1_in` values stitched in. Then use `prepare`. Alternatively, you could add `USE INDEX(time_stamp)` to the `SELECT`. — Rick James, Mar 23 '16 at 05:18
I used force index and the explain said that it was using it. — Patrick, Mar 23 '16 at 16:38
Ah. It is still scanning half the table. (That is, time_stamp is not much of a filter.) So, using the index is _not_ efficient. However, if the PK _started_ with time_stamp, it would be somewhat useful. _Or_ the `UNION` with the two new indexes will help. (Again, cardinality may get in the way of helping much.) How many rows do you expect? — Rick James, Mar 23 '16 at 18:52
100K rows will take some time. Still, the advice so far should help significantly: `UNION` + indexes; ENUMs and/or normalization; `innodb_buffer_pool_size; etc. (It's hard to say which will give you the most bang for the buck.) — Rick James, Mar 23 '16 at 19:39
According to the estimates in the `EXPLAIN`, that 4-way `UNION` hit about 1.2M rows. And only hit them in the indexes, not the table, itself. This is probably the minimum number of rows to hit. The sluggish is possibly due to the tmp tables and dedupping. — Rick James, Mar 24 '16 at 05:28

score 0 · Answer 6 · answered Mar 24 '16 at 05:50

The other approach to speeding up large queries on large tables is to build and maintain a "Summary table".

Let's say you typically want to look at "hours" (as opposed to days or months, etc). The summary table for this query (and many some other queries) would be something like

CREATE TABLE foo (
    hr MEDIUMINT UNSIGNED NOT NULL,  -- derived from time_stamp; see below
    alarm_severity ...  -- preferably an ENUM, not VARCHAR
    event_type ...
    pv_name ...
    ct INT UNSIGNED -- if you want to know how many
    PRIMARY KEY(hr, alarm_severity, event_type)
) ENGINE=InnoDB;

After the end of each hour:

INSERT INTO foo
    SELECT FLOOR(time_stamp / 3600e9),
           alarm_severity, event_type, pv_name,
           COUNT(*)
        FROM events
        WHERE time_stamp >= ...  -- start of previous hour
          AND time_stamp < ...   -- and end
        GROUP BY 1,2,3,4;

Then the original query becomes

SELECT  DISTINCT pv_name
    FROM  foo
    WHERE  hr >= t0_in / 3600e9
      AND  hr <  t1_in / 3600e9
      AND ( alarm_severity = 'INVALID'
       OR   event_type IN ('add', 'disconnect', 'remove')
          );

The final SELECT will be easily less than a 1 second. But it requires that the data not change once it is inserted, etc.

You previously had AND value IS NULL. That may be added into the INSERT..SELECT, or you may need value_is_null as a true/false flag in foo and its PK.

MySQL select distinct optimization

6 Answers6

Indexes and RAM

Partitioning

Better keys

Update