0

I have a table process_execution_data ped and I would like to self-join it.

Basically, this table holds the key, value pairs of existing variables during a execution piid of a process process_id.

I've written two very similar queries to do it, their only difference is that one has an extra search condition in the ON for each INNER JOIN.

Query1 - with x.process_id = y.process_id in each ON/JOIN:

SELECT
  queue.value
FROM
  process_execution_data queue
INNER JOIN process_execution_data status_code
ON status_code.process_id = queue.process_id
AND status_code.piid = queue.piid
AND status_code.key_name = 'status_code'
INNER JOIN process_execution_data ref_num
ON ref_num.process_id = queue.process_id
AND ref_num.piid = queue.piid
AND ref_num.key_name = 'ref_num'
WHERE
  queue.process_id = 132
  AND queue.key_name = 'queue'
  AND queue.value = "type_a"
  AND queue.created BETWEEN FROM_UNIXTIME(x) AND FROM_UNIXTIME(y)

Query2

SELECT
  queue.value
FROM
  process_execution_data queue
INNER JOIN process_execution_data status_code
ON status_code.piid = queue.piid
AND status_code.key_name = 'status_code'
INNER JOIN process_execution_data ref_num
ON ref_num.piid = queue.piid
AND ref_num.key_name = 'ref_num'
WHERE
  queue.process_id = 132
  AND queue.key_name = 'queue'
  AND queue.value = "type_a"
  AND queue.created BETWEEN FROM_UNIXTIME(x) AND FROM_UNIXTIME(y)

Query1 runs in 80s~, meanwhile Query2 runs in 1.3s.

What gives such huge difference? Both are using indexes properly (albeit different indexes). But if anything, shouldn't Query1 run faster, since it filters more rows and as well as "lower cost" (as pointed out by the EXPLAIN)?


Additional ped table info

SHOW CREATE TABLE `process_execution_data`
>>>
 CREATE TABLE `process_execution_data` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `key_name` varchar(512) NOT NULL,
   `value` longtext,
   `piid` varchar(512) NOT NULL,
   `created` datetime DEFAULT NULL,
   `modified` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   `process_id` int(11) DEFAULT NULL,
   `flags` varchar(512) DEFAULT NULL,
   PRIMARY KEY (`id`),
   KEY `all` (`process_id`,`key_name`,`modified`,`created`),
   KEY `key_piid_modified_created` (`key_name`,`piid`,`modified`,`created`),
   KEY `processid_key` (`process_id`,`key_name`),
   KEY `processid` (`process_id`),
   KEY `key` (`key_name`),
   KEY `piid` (`piid`),
   KEY `created` (`created`),
   KEY `modified` (`modified`)
 ) ENGINE=InnoDB AUTO_INCREMENT=31134333 DEFAULT CHARSET=latin1
Query1 EXPLAIN FORMAT=JSON SELECT
>>>
 '{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "722877.86"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "status_code",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid"
          ],
          "key": "all",
          "used_key_parts": [
            "process_id",
            "key_name"
          ],
          "key_length": "519",
          "ref": [
            "const",
            "const"
          ],
          "rows_examined_per_scan": 265312,
          "rows_produced_per_join": 265312,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "265312.00",
            "eval_cost": "53062.40",
            "prefix_cost": "318374.40",
            "data_read_per_join": "528M"
          },
          "used_columns": [
            "key_name",
            "piid",
            "process_id"
          ]
        }
      },
      {
        "table": {
          "table_name": "queue",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid",
            "created"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.status_code.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 13265,
          "filtered": "4.13",
          "index_condition": "(`brain`.`queue`.`created` between <cache>(from_unixtime(1659841200)) and <cache>(from_unixtime(1660445999)))",
          "cost_info": {
            "read_cost": "321034.49",
            "eval_cost": "2653.12",
            "prefix_cost": "703615.79",
            "data_read_per_join": "26M"
          },
          "used_columns": [
            "key_name",
            "value",
            "piid",
            "created",
            "process_id"
          ],
          "attached_condition": "((`brain`.`queue`.`process_id` = `brain`.`status_code`.`process_id`) and (`brain`.`queue`.`value` = ''type_a''))"
        }
      },
      {
        "table": {
          "table_name": "ref_num",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.status_code.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 5661,
          "filtered": "35.27",
          "cost_info": {
            "read_cost": "16051.73",
            "eval_cost": "1132.38",
            "prefix_cost": "722877.86",
            "data_read_per_join": "11M"
          },
          "used_columns": [
            "key_name",
            "piid",
            "process_id"
          ],
          "attached_condition": "(`brain`.`ref_num`.`process_id` = `brain`.`status_code`.`process_id`)"
        }
      }
    ]
  }
}'
Query2 EXPLAIN FORMAT=JSON SELECT
>>>
'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "974280.63"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "ref_num",
          "access_type": "ref",
          "possible_keys": [
            "key_piid_modified_created",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name"
          ],
          "key_length": "514",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 529372,
          "rows_produced_per_join": 529372,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "66172.38",
            "eval_cost": "105874.40",
            "prefix_cost": "172046.78",
            "data_read_per_join": "1G"
          },
          "used_columns": [
            "key_name",
            "piid"
          ]
        }
      },
      {
        "table": {
          "table_name": "queue",
          "access_type": "ref",
          "possible_keys": [
            "all",
            "key_piid_modified_created",
            "processid_key",
            "processid",
            "key",
            "piid",
            "created"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.ref_num.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 26468,
          "filtered": "4.13",
          "index_condition": "(`brain`.`queue`.`created` between <cache>(from_unixtime(1659841200)) and <cache>(from_unixtime(1660445999)))",
          "cost_info": {
            "read_cost": "640554.02",
            "eval_cost": "5293.72",
            "prefix_cost": "940711.60",
            "data_read_per_join": "52M"
          },
          "used_columns": [
            "key_name",
            "value",
            "piid",
            "created",
            "process_id"
          ],
          "attached_condition": "((`brain`.`queue`.`value` = ''type_a'') and (`brain`.`queue`.`process_id` = 132))"
        }
      },
      {
        "table": {
          "table_name": "status_code",
          "access_type": "ref",
          "possible_keys": [
            "key_piid_modified_created",
            "key",
            "piid"
          ],
          "key": "key_piid_modified_created",
          "used_key_parts": [
            "key_name",
            "piid"
          ],
          "key_length": "1028",
          "ref": [
            "const",
            "brain.ref_num.piid"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 32027,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "27163.49",
            "eval_cost": "6405.54",
            "prefix_cost": "974280.63",
            "data_read_per_join": "63M"
          },
          "used_columns": [
            "key_name",
            "piid"
          ]
        }
      }
    ]
  }
}'
philipxy
  • 14,867
  • 6
  • 39
  • 83
  • Tiny mnemonic aliases make code much easier to read. Table format is much more readable then JSON. The SO editor has (besides 2 code block formats) a special table format. Although it has limited markdown in cells.) – philipxy Aug 18 '22 at 01:21
  • [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) [Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) – philipxy Aug 18 '22 at 01:58
  • What?? `SELECT value ... WHERE value = '...'` -- What's the purpose of the query when you know the answer? (Well, there could be zero or more than 1 copy of "type_a" in the output, but wtf?) – Rick James Aug 18 '22 at 17:13

2 Answers2

1

To really have better optimization of your query, index should be more complete. You have partial vs all fields -- in order for your WHERE clause. I would suggest an index

ON ( Process_id, Key_Name, Value, Created)

This will optimize your primary FROM table regardless of the joins.

Your JOINs in the first query are using 3 columns, where your second query only 2 columns. One would probably think the second query processes slower as its possible to return multiple entries spanning multiple process IDs but having join on the PIID.

For optimization on your secondary JOIN tables, have an index on those criteria via ( Process_id, piid, key_name ).

And your second query, an index on ( key_name, piid ), so within each respective JOIN, all the same key_name index components are grouped (status_code vs ref_num), THEN the piids subsorted. I suspect this might give better performance.

I would also suggest removing redundant indexes. For example, you have indexes on

   KEY `all` (`process_id`,`key_name`,`modified`,`created`),
   KEY `key_piid_modified_created` (`key_name`,`piid`,`modified`,`created`),
   KEY `processid_key` (`process_id`,`key_name`),
   KEY `processid` (`process_id`),
   KEY `key` (`key_name`),
   KEY `piid` (`piid`),
   KEY `created` (`created`),
   KEY `modified` (`modified`)

In the above case, the index processid is redundant from the ALL and PROCESSID_KEY index.

ProcessID_Key is also redundant of the ALL index.

Having the stand-alone column indexes is not the greatest plan unless you are explicitly querying on ONLY that column and no other critical elements of importance. They just take up space and index pages of data.

With respect to order of columns in the index, it IS important. You are querying based on when records are CREATED, yet your ALL index has modified, THEN created column. If you think about it, if you are ever trying to filter based on a MODIFIED, it would have to be CREATED FIRST, and modified only on/after when created. So would it not make sense to order by the CREATED DATE first, then take into consideration modified.

Anyhow, just some observations and hopefully clarification on better indexing options for you.

Rick James
  • 135,179
  • 13
  • 127
  • 222
DRapp
  • 47,638
  • 12
  • 72
  • 142
  • Columns .created and .modified are interchangeable in this use case - good point. I made sure to use modified to use proper use of indexes. About ( Process_id, Key_Name, Value, Created) missing INDEX, at first the team responsible for the db didn't feel the need to INDEX value as it is a LONGTEXT, but I will bring this item once again to discussion and see the possibilities (as well figure out the proper key length/indexing for the column) (+https://stackoverflow.com/questions/1827063/mysql-error-key-specification-without-a-key-length). Thanks for the input so far DRapp, I will update as I can – Thiago Pestitschek Aug 18 '22 at 04:14
  • @ThiagoPestitschek, ok on VALUE column. Makes sense, but from sample text, would not have thought a LONGTEXT for the sample context in your question. Keeping the created in that index would still be better to keep. – DRapp Aug 18 '22 at 11:31
  • re: CREATED and MODIFIED -- When there are two _range_ tests, the first one may make use of the column in the index; the second one cannot. – Rick James Aug 18 '22 at 17:21
  • You cannot include any sized `TEXT` in an `INDEX`. And "index prefixing" (such as `... value(30), ...`) works, but won't help. – Rick James Aug 18 '22 at 17:23
  • @RickJames, thanks on that, and the edit update. I was not looking at the data type specifically when originally suggesting the index for that column. – DRapp Aug 18 '22 at 18:35
  • @RickJames Pardon if the question seems silly, but ["index prefixing" (such as ... value(30), ...) works, but won't help]. Why is that the case ("why won't it help?") ? – Thiago Pestitschek Aug 18 '22 at 20:16
  • @ThiagoPestitschek - The use of the index columns mostly stops when it hits either a column not being tested with `=`; that effectively includes the cases where you don't have the whole column in the `INDEX`. By "stops", I mean that it won't get to any columns after it – Rick James Aug 18 '22 at 20:53
  • @ThiagoPestitschek - I, too, almost recommended `INDEX(process_id, key_name, value, created, piid)` as beubg :oerfect: A way to count the row-hits is with "Handler" counts: ( http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts ); I use it to demonstrate the failure to touch only the rows needed. – Rick James Aug 18 '22 at 20:58
  • @RickJames - [1/2] Still on the subject of INDEX prefixing *[Index stops when it hits =, that includes the cases where you don't have the whole column in the INDEX]*. Since the whole `LONGTEXT` column is not INDEXed (since it can't `LONGTEXT` and we'd have to index prefix, ..value(30) or any other key length). Does it mean the column somehow is no longer effectively tested with `=` in the inner workings of InnoDB **OR** is it the case only for the rows where `value` is longer than the size of the keylength (since the indexed prefix is too short to be able to recover/filter the rows properly)? – Thiago Pestitschek Aug 19 '22 at 05:38
  • @RickJames - [2/2] Asking because, for the use case of this question, the `queue.value = "xx"` is short and realistic - Most values for column `value` are indeed short (<30 length). In short, wondering whether the INDEX on value is **needed+helpful** (as suggested by @DRapp and you [https://stackoverflow.com/questions/73214971/query-runs-duration-fast-fetch-slow-but-group-by-or-any-aggregate-function] or **worthless/unnecessary** *([And "index prefixing" (such as ... value(30), ...) works, but won't help]*. Thank you (both!) so far for the immense feedback. – Thiago Pestitschek Aug 19 '22 at 05:38
  • @ThiagoPestitschek - I _suspect without proof_ that some newer versions of MySQL (or MariaDB) achieve the points you mention. I "believe" that old versions blindly punted when a prefix-index was encountered. Do some experimenting and write blog or Question or bug based on what you find. Meanwhile, I prefer to provide answers that work on all versions on this open forum. Some readers are still using 5.1. – Rick James Aug 19 '22 at 14:55
0

"Cost" is a good metric. But is it not perfect. It probably has no histogram of index distribution, especially with composite indexes. Nor does it take into account I/O needed, especially for the current state of the cache (buffer_pool).

These should be optimal for both queries:

queue:  INDEX(process_id, key_name, created,  piid)
status_code:  INDEX(piid, key_name, process_id)
ref_num:      INDEX(piid, key_name, process_id)

When adding a composite index, DROP index(es) with the same leading columns. That is, when you have both INDEX(a) and INDEX(a,b), toss the former.

EXPLAIN says that it stopped after 2 used_key_parts.

If value were not LONGTEXT, it could be included in the index. This is one of several serious problems with the EAV schema pattern.

Index Cookbook

Rick James
  • 135,179
  • 13
  • 127
  • 222