How to speed up loop in MySQL, fill lines based on previous row value

Question

My database is composed by individual job contracts. I am updating some information to enhance the quality of the data. More precisely, I am updating information regarding workers' residence codes. In the following image I am showing an example of my database in the following image (the .csv version could be found here).

While variables are explained here below.

id -----------> "Primary key" [indexed]
worker_id ----> "Id associated ot each individual/worker" [indexed]
dt_start -----> "Starting date of the job contract"
dt_end -------> "End date of the job contract"
cod_res ------> "Old residence code"
cod_res_rev --> "New residence code"
id_lag -------> "Previous id, if the 'worker_id' is the same" [indexed]
id_lead ------> "Subsequent id, it the 'worker_id' is the same" [indexed]

As you can notice, the column cod_res_rev is characterized is full of NULL values. This is because the reconstruction of the variable cod_res_rev with the updated residence values it was based solely on specific contracts (those for which the worker had had an actual change of residence - but this is redundant for the purposes of my question). Therefore, my goal is to fill each NULL value of the column cod_res_rev with the previous one, if not missing, until the next non-empty value is reached and continue like this for each worker. The result should be something like this.

I attempted to achieve my goal through the following procedure.

-- The loop is performed based on the maximum number of entries per worker in the database identified through the table 'max_count'.
drop table if exists max_count;
create table max_count 
as select worker_id, count(*) n 
from ml_arm
group by worker_id;
alter table max_count add unique index (worker_id); 

DROP PROCEDURE IF EXISTS doiterate;
delimiter //

CREATE PROCEDURE doiterate()
BEGIN
  DECLARE total INT unsigned DEFAULT 0;
  WHILE total <= (select MAX(n) from max_count) DO

update ml_arm a
left outer join ml_arm b on a.id_lag = b.id
set a.cod_res_rev = 
    case 
    when a.cod_res_rev is NULL and a.worker_id = b.worker_id and b.cod_res_rev is not NULL
    then b.cod_res_rev 
    else a.cod_res_rev 
    end;

    SET total = total + 1;
  END WHILE;
END//  

delimiter ;

CALL doiterate();

However, I do not believe this is the optimal way to update my table. In fact, by database if composed by about 25 million of rows and the value from select MAX(n) from max_count is about 4,000. I kindly ask you for any suggestion on faster approaches to update my table. I am using MySQL 8.0.22. Thank you in advance.

Eventually, here below there is a command to create a sample table of my database with a bunch of entries.

drop table if exists ml_arm;
create table ml_arm (
    id MEDIUMINT NOT NULL AUTO_INCREMENT,
    worker_id int,
    dt_start date,
    dt_end date,
    cod_res varchar(50),
    cod_res_rev varchar(50),
    id_lag int,
    id_lead int,
    PRIMARY KEY (id)
);

insert into 
    ml_arm(id, worker_id, dt_start, dt_end, cod_res, cod_res_rev, id_lag, id_lead)
values
    ('12', '20', '2014-05-02', '2014-07-08', '', 'I040', NULL, '13'),
    ('13', '20', '2017-01-14', '2017-01-31', '', NULL, '12', '14'),
    ('14', '20', '2017-11-06', '2017-12-15', 'I040', NULL, '13', NULL),
    ('20', '29', '2014-11-24', '2017-02-11', '', 'N.D.', NULL, NULL),
    ('21', '42', '2016-01-22', '2016-05-05', 'G582', 'G582', NULL, NULL),
    ('23', '45', '2013-08-07', '2014-04-06', 'G582', 'G582', NULL, '24'),
    ('24', '45', '2014-05-07', '2014-05-10', 'G582', NULL, '23', NULL),
    ('25', '48', '2012-08-11', '2012-08-31', 'G582', 'G582', NULL, '26'),
    ('26', '48', '2013-08-10', '2013-08-31', 'G582', NULL, '25', NULL),
    ('53', '71', '2016-12-01', '2017-05-31', '', 'N.D.', NULL, '54'),
    ('54', '71', '2017-06-01', '2020-05-29', '', NULL, '53', '55'),
    ('55', '71', '2020-06-01', '2099-01-01', '', NULL, '54', NULL)
    ;

Provide sample data as CREATE TABLE + INSERT INTO (2 users, 5-7 rows each, is enough). Specify your MySQL version. — Akina, Jul 25 '22 at 07:46
UDV usage (inline assignment) is deprecated. If you have 8+ version I'd recommend you to use lateral subquery instead. — Akina, Jul 25 '22 at 07:48
Thank you Akina. I am using MySQL 8.0.22. Furtheremore, at the end of my question I have added a sample data with CREATE TABLE and INSERT INTO. — Nicola Caravaggio, Jul 25 '22 at 08:08
Since you are using 8.0, have you experimented with `lag` and `LEAD(1) OVER ... PARTITION BY ...`? That would get rid of the id_lead/lag columns. But is it fast enough? — Rick James, Jul 26 '22 at 01:08
@RickJames, the column `code_res` is actually useless, I just needed to check differences but for the update/loop is unnecessary. — Nicola Caravaggio, Jul 26 '22 at 04:49

Nicola Caravaggio · Answer 1 · 2022-07-25T08:07:33.603

0

By following Gordon Linoff's answer on this thread it seems to work:

UPDATE ml_arm t1
  JOIN (
    SELECT id, @s:=IF(cod_res_rev IS NULL, @s, cod_res_rev) cod_res_rev
      FROM (SELECT * FROM ml_arm ORDER BY id) r,
    (SELECT @s:= NULL) t
  ) t2
  ON t1.id = t2.id
SET t1.cod_res_rev = t2.cod_res_rev;

edited Jul 25 '22 at 08:07

answered Jul 25 '22 at 07:49

Nicola Caravaggio

71
12

How to speed up loop in MySQL, fill lines based on previous row value

1 Answers1