Efficient MySQL query for huge set of data

Question

Say i have a table like below:

CREATE TABLE `hadoop_apps` (
  `clusterId` smallint(5) unsigned NOT NULL,
  `appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
  `queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `submitTime` datetime NOT NULL COMMENT 'App submission time',
  `finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
  `elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
  PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
  KEY `hadoop_apps_ibk_finish` (`finishTime`),
  KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
  KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

mysql> SELECT COUNT(*) FROM hadoop_apps;

This would return me a count 158593816

So I am trying to understand what is inefficient about the below query and how I can improve it.

mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';

Also, what's the difference between these two queries?

mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';

score 3 · Accepted Answer · answered Sep 15 '17 at 22:26

WHERE DATE(finishTime)='10-11-2013';

This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.

To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.

Also, you should use MySQL standard date format: YYYY-MM-DD.

WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'

What is the difference between [conditions in WHERE and HAVING clauses]?

The WHERE clause is for filtering rows.

The HAVING clause is for filtering results after applying GROUP BY.

See SQL - having VS where

Also one more thing, does WHERE and HAVING has an impact on the query that i have mentioned? For ones without any GROUP BY clause. — paulophoenix, Sep 15 '17 at 22:52
@paulophoenix, you should use WHERE for your query. Don't use HAVING unless you're filtering groups. — Bill Karwin, Sep 18 '17 at 02:26

score 1 · Answer 2 · answered Sep 15 '17 at 23:24

If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.

I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.

Even changing to DATETIME(6) (microseconds) could be risky.

Regardless of what you do in that area, I recommend this pattern for testing:

WHERE finishTime >= '2013-10-11'
  AND finishTime  < '2013-10-11' + INTERVAL 1 DAY

It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.

KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)

is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.

158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.

Efficient MySQL query for huge set of data

2 Answers2