2

I have a table "data" which holds around 100,000,000 records. I have added a new column to it "batch_id" (Integer).

On the application layer, I'm updating the batch_id in batches of 10,000 records for each of the 100,000,000 records (the batch_id is always the same for 10k).

I'm doing something like this (application layer pseudo code):

loop {
  $batch_id = $batch_id + 1;
  mysql.query("UPDATE data SET batch_id='$batch_id' WHERE batch_id IS NULL LIMIT 10000");
}

I have an index on the batch_id column.

In the beginning, this update statement took ~30 seconds. I'm now halfway through the Table and it's getting slower and slower. At the moment the same statement takes around 10 minutes(!). It reached a point where this is no longer feasible as it would take over a month to update the whole table at the current speed.

What could I do to speed it up, and why is MySQL Getting slower towards the end of the table? Could an index on the primary key help?

J.Doe
  • 61
  • 3

2 Answers2

2

Is the primary key automatically indexed in MySQL? The answer is Yes

So instead one index for batch_id will help.

The problem is without index the engine do a full table scan. At first is easy find 10k with null values, but when more and more records are updated the engine have to scan much more to find those nulls.

But should be easier create batch_id as an autonumeric column

OTHER OPTION: Create a new table and then add the index and replace old table.

CREATE newTable as 
    SELECT IF(@newID := @newID + 1,
              @newID DIV 10000,
              @newID DIV 10000) as batch_id,
           <other fields>
    FROM YourTable         
    CROSS JOIN (SELECT @newID :=0 ) as v

Insert auto increment primary key to existing table

Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
  • Just to clear up potential confusion: 1) I already have an index on batch_id 2) batch_id is not the primary key (id is the primary key) Regarding "could an index on the primary key help?", I was wondering if maybe the query "falls back" on the id/primary key (according to my knowledge mysql still has to scan the whole table on primary key indexes as opposed to normal indexes) – J.Doe Aug 14 '17 at 15:56
  • Then Im not sure what is the problem. Maybe is updating the index is taking long time. You should compare [ANALYZE](http://dba.stackexchange.com/questions/15371/how-do-i-get-the-execution-plan-for-a-view) from early fast updates and the later slower updates. Other option is create a new table and then replace the old one. – Juan Carlos Oropeza Aug 14 '17 at 16:18
0

Do you have a monotonically increasing id in the table? And all rows for a "batch" have 'consecutive' ids? Then don't add batch_id to the table, instead, create another table Batches with one row per batch: (batch_id (PK), id_start, id_end, start_time, end_time, etc).

If you stick to exact chunks of 10K, then don't even materialize batch_id. Instead, compute it from id DIV 10000 whenever you need it.

If you want to discuss this further, please provide SHOW CREATE TABLE for the existing table, and explain what you will be doing with the "batches".

To answer your question about "slow near the end": It is having to scan farther and farther in the table to find the NULLs. You would be better to walk through the table once, fiddling with each 10K chunk as you go. Do this using the PRIMARY KEY, whatever it is. (That is, even if it is not AUTO_INCREMENT.) More Details .

Rick James
  • 135,179
  • 13
  • 127
  • 222