-2

My question is not specific but it is abstract. I often deal with background DB processing in my projects, but still have no good strategy how to do that properly.

For example, we have a table with lot of records, several hundred thousand records or event millions. Background daemon permanently modifies the data of this table. It updates and inserts records based on other data got from outside. To save a time, I use multi-INSERTs to save a time, because I even can not imagine how long may take updating records one by one. Also, since background process runs in multi threading mode, I forced to use WRITE LOCK on that table. If I don't do that, then I recevinig lot of deadlock errors, because each multiinsert blocks gaps and autoincrements in table and they can not execute independently. WRITE LOCK puts all queries in the queue and they executed sequentially, next query waits while previous is completed. May be here is my first mistake, but I found no other effective method to run multi-inserts in multithread mode. Simple schema of daemon looks like:

MAIN PROCESS
-- WORKER 1: Receiving data over HTTP --> INSERT table VALUES (..., ..., ...)
-- WORKER 2: Receiving data over HTTP --> INSERT table VALUES (..., ..., ...)
-- WORKER 3: Receiving data over HTTP --> INSERT table VALUES (..., ..., ...)

From the other side we have an user interface that displays data from the table. It performs SELECT query from our table. But as I said if background process is running, WRITE lock is activated. That means READ operations are also should to wait while lock is active. For the user it looks like a whole server hanged.

That why I writing this question. Are there any strategies to modify big amounts of data together with ability of reading it?

Community
  • 1
  • 1
WindBridges
  • 693
  • 2
  • 11
  • 23
  • 1
    The workers for data ingest insert **without** locks into a worktable. In your info above, those 3 workers independently do those inserts. Either onsey-twosey or in bulk. An event [001](http://stackoverflow.com/a/37901661/1816093) and [Evt Overview](http://stackoverflow.com/a/32508935) and [Debugging/Performance monitoring](http://stackoverflow.com/a/38022108) is crafted: (inserts, inserts with a join pattern, insert on duplicate key update, you name it) ---> into the real table. As the rows in the worktable are processed, they are flagged as done. Cleaned up at end of evt. – Drew Jul 11 '16 at 20:43
  • 1
    So the Event or Events run on the schedule of your choosing. No cron, etc. They have the data needed. In theory the whole locking aspect is either eliminated or seriously reduced in scope and impact. – Drew Jul 11 '16 at 20:47

1 Answers1

1

Instead of WRITE LOCK, let's minimize the deadlocks and deal with those that still happen.

  • Sort the data that you are batching (multi-insert or updating).
  • Don't batch "too many". Beyond about 100 rows, a batch insert is into "diminishing returns". That is, it is not worth batching 1000 at a time. If 100 gives you too many deadlocks, then pick a lower number.
  • When a deadlock happens, replay the transaction.
  • Collect statistics. How many rows, how many transactions, how many deadlocks, etc. These metrics will help you tune the batching.

Staging table. Another approach is to funnel things through a staging table (worktable) to a single processor; multiple sources can be feeding it. That is, have many connections inserting into a single table with no indexes and as little overhead as possible. Have a process that is working on another table; flip-flop the tables when finished. See high speed ingestion.

Note that normalization and other messy processes can be handled while the data is in the staging table, thereby not bothering the main table and the reads.

What I just described is similar to what @Drew mentioned in the comments. However, I suggest continually processing and flipflopping -- this avoids messiness of deletes, etc.

On the read side, make sure that the queries are well optimized.

Rick James
  • 135,179
  • 13
  • 127
  • 222