I'm looking for some advice on how I might better optimize this query.
For each _piece_detail
record that:
- Contains at least one matching
_scan
record on (zip, zip_4, zip_delivery_point, serial_number) - Belongs to a company from
mailing_groups
(through a chain of relationships) - Has either:
first_scan_date_time
that is greater than theMIN(scan_date_time)
of the related _scan recordslatest_scan_date_time
that is less than theMAX(scan_date_time)
of the related_scan
records
I will need to:
- Set
_piece_detail.first_scan_date_time
toMIN(_scan.scan_date_time)
- Set
_piece_detail.latest_scan_date_time
toMAX(_scan.scan_date_time)
Since I'm dealing with millions upon millions of records, I am trying to reduce the number of records that I actually have to search through. Here are some facts about the data:
- The _piece_details table is partitioned by
job_id
, so it seems to make the most sense to run through these checks in the order of_piece_detail.job_id
,_piece_detail.piece_id
. - The scan records table contains over 100,000,000 records right now and is partitioned by (zip, zip_4, zip_delivery_point,
serial_number, scan_date_time), which is the same key that is used
to match a _scan with a
_piece_detail
(aside from scan_date_time). - Only about 40% of the
_piece_detail
records belong to amailing_group
, but we don't know which ones these are until we run through the full relationship of joins. - Only about 30% of the _scan records belong to a
_piece_detail
with amailing_group
. - There are typically between 0 and 4
_scan
records per_piece_detail
.
Now, I am having a hell of a time finding a way to execute this in a decent way. I had originally started with something like this:
UPDATE _piece_detail
INNER JOIN (
SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
FROM _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
INNER JOIN _scan
ON _scan.zip = _piece_detail.zip
AND _scan.zip_4 = _piece_detail.zip_4
AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point
AND _scan.serial_number = _piece_detail.serial_number
GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id
SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time
OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
I thought that this may have been trying to load too much into memory at once and might not be using the indexes properly.
Then I thought that I might be able to avoid doing that huge joined subquery and add two leftjoin subqueries to get the min/max like so:
UPDATE _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time ASC
LIMIT 1
)
LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time DESC
LIMIT 1
)
SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
WHERE _piece_detail.first_scan_date_time < fs.scan_date_time
OR _piece_detail.latest_scan_date_time > ls.scan_date_time
These are the explains when I convert them to SELECT statements:
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 844161 | NULL |
| 1 | PRIMARY | _piece_detail | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY | 18 | t1.job_id,t1.piece_id | 1 | Using where |
| 2 | DERIVED | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | _piece_detail | ref | PRIMARY,cqt_database_id,zip | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 2 | DERIVED | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 2 | DERIVED | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 2 | DERIVED | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 2 | DERIVED | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 2 | DERIVED | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using index |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| 1 | PRIMARY | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index |
| 1 | PRIMARY | _piece_detail | ref | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 1 | PRIMARY | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 1 | PRIMARY | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 1 | PRIMARY | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 1 | PRIMARY | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 1 | PRIMARY | fs | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | ls | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Now, looking at the explains generated by each, I really can't tell which is giving me the best bang for my buck. The first one shows fewer total rows when multiplying the rows column, but the second appears to execute a bit quicker.
Is there anything that I could do to achieve the same results while increasing performance through modifying the query structure?