My database is composed by individual job contracts. I am updating some information to enhance the quality of the data. More precisely, I am updating information regarding workers' residence codes. In the following image I am showing an example of my database in the following image (the .csv version could be found here).
While variables are explained here below.
id -----------> "Primary key" [indexed]
worker_id ----> "Id associated ot each individual/worker" [indexed]
dt_start -----> "Starting date of the job contract"
dt_end -------> "End date of the job contract"
cod_res ------> "Old residence code"
cod_res_rev --> "New residence code"
id_lag -------> "Previous id, if the 'worker_id' is the same" [indexed]
id_lead ------> "Subsequent id, it the 'worker_id' is the same" [indexed]
As you can notice, the column cod_res_rev
is characterized is full of NULL
values. This is because the reconstruction of the variable cod_res_rev
with the updated residence values it was based solely on specific contracts (those for which the worker had had an actual change of residence - but this is redundant for the purposes of my question). Therefore, my goal is to fill each NULL
value of the column cod_res_rev
with the previous one, if not missing, until the next non-empty value is reached and continue like this for each worker. The result should be something like this.
I attempted to achieve my goal through the following procedure.
-- The loop is performed based on the maximum number of entries per worker in the database identified through the table 'max_count'.
drop table if exists max_count;
create table max_count
as select worker_id, count(*) n
from ml_arm
group by worker_id;
alter table max_count add unique index (worker_id);
DROP PROCEDURE IF EXISTS doiterate;
delimiter //
CREATE PROCEDURE doiterate()
BEGIN
DECLARE total INT unsigned DEFAULT 0;
WHILE total <= (select MAX(n) from max_count) DO
update ml_arm a
left outer join ml_arm b on a.id_lag = b.id
set a.cod_res_rev =
case
when a.cod_res_rev is NULL and a.worker_id = b.worker_id and b.cod_res_rev is not NULL
then b.cod_res_rev
else a.cod_res_rev
end;
SET total = total + 1;
END WHILE;
END//
delimiter ;
CALL doiterate();
However, I do not believe this is the optimal way to update my table. In fact, by database if composed by about 25 million of rows and the value from select MAX(n) from max_count
is about 4,000. I kindly ask you for any suggestion on faster approaches to update my table. I am using MySQL 8.0.22. Thank you in advance.
Eventually, here below there is a command to create a sample table of my database with a bunch of entries.
drop table if exists ml_arm;
create table ml_arm (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
worker_id int,
dt_start date,
dt_end date,
cod_res varchar(50),
cod_res_rev varchar(50),
id_lag int,
id_lead int,
PRIMARY KEY (id)
);
insert into
ml_arm(id, worker_id, dt_start, dt_end, cod_res, cod_res_rev, id_lag, id_lead)
values
('12', '20', '2014-05-02', '2014-07-08', '', 'I040', NULL, '13'),
('13', '20', '2017-01-14', '2017-01-31', '', NULL, '12', '14'),
('14', '20', '2017-11-06', '2017-12-15', 'I040', NULL, '13', NULL),
('20', '29', '2014-11-24', '2017-02-11', '', 'N.D.', NULL, NULL),
('21', '42', '2016-01-22', '2016-05-05', 'G582', 'G582', NULL, NULL),
('23', '45', '2013-08-07', '2014-04-06', 'G582', 'G582', NULL, '24'),
('24', '45', '2014-05-07', '2014-05-10', 'G582', NULL, '23', NULL),
('25', '48', '2012-08-11', '2012-08-31', 'G582', 'G582', NULL, '26'),
('26', '48', '2013-08-10', '2013-08-31', 'G582', NULL, '25', NULL),
('53', '71', '2016-12-01', '2017-05-31', '', 'N.D.', NULL, '54'),
('54', '71', '2017-06-01', '2020-05-29', '', NULL, '53', '55'),
('55', '71', '2020-06-01', '2099-01-01', '', NULL, '54', NULL)
;