0

I have this table documents based on client's set of data. Only provided key is documentId, but it's non-unique (and rowId is mine, not client's).

This table also shows which documents have been processed and which have not, plus some additional data. isProcessed is a generated column (isProcessed bool as (processed is not null)) for indexing purposes.

SELECT * FROM documents;

| rowId | documentId | received            | processed           | user  | isProcessed |
| ----- | ---------- | ------------------- | ------------------- | ----- | ----------- |
| 1     | 10020      | 2020-11-27 15:01:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 2     | 10020      | 2020-11-27 15:02:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 3     | 10021      | 2020-11-27 15:03:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 4     | 10022      | 2020-11-27 15:04:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 5     | 10022      | 2020-11-27 15:05:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 6     | 10022      | 2020-11-27 15:06:00 |                     |       | 0           |
| 7     | 10022      | 2020-11-27 15:07:00 |                     |       | 0           |
| 8     | 10022      | 2020-11-27 15:08:00 |                     |       | 0           |
| 9     | 10023      | 2020-11-27 15:09:00 |                     |       | 0           |
| 10    | 10024      | 2020-11-27 15:10:00 |                     |       | 0           |
| 11    | 10024      | 2020-11-27 15:11:00 |                     |       | 0           |
| 12    | 10025      | 2020-11-27 15:12:00 |                     |       | 0           |
| 13    | 10025      | 2020-11-27 15:13:00 |                     |       | 0           |
| 14    | 10025      | 2020-11-27 15:14:00 |                     |       | 0           |
| 15    | 10025      | 2020-11-27 15:15:00 |                     |       | 0           |

Then, when employees process some documents, they pass their documentId's into an app, generating a JSON like this:

{
  "user": "user2",
  "documents": [
      10022,
      10022,
      10023,
      10024,
      10025,
      10025,
      10025
  ]
}

Now I need to figure out how to:

a) update documents to set processed and user for first matching unprocessed documents, but no more than that!

That's 2 x 10022 (rows 6, 7), 1 x 10023 (row 9), 1 x 10024 (row 10), 3 x 10025 (rows 12, 13, 14).

What I've got so far is:

SET @processedTS = NOW();

UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10022 AND !isProcessed ORDER BY rowId LIMIT 2;
UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10023 AND !isProcessed ORDER BY rowId LIMIT 1;
UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10024 AND !isProcessed ORDER BY rowId LIMIT 1;
UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10025 AND !isProcessed ORDER BY rowId LIMIT 3;

Is there a smarter way to approach this than generate multiple update statements?

b) select the updated documents to get their additional information and make a results file to send to the client

SELECT * FROM documents WHERE processed = @processedTS;

Is there a smarter way to select the updated data?

Expected results

a) table documents after updates

| rowId | documentId | received            | processed           | user  | isProcessed |
| ----- | ---------- | ------------------- | ------------------- | ----- | ----------- |
| 1     | 10020      | 2020-11-27 15:01:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 2     | 10020      | 2020-11-27 15:02:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 3     | 10021      | 2020-11-27 15:03:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 4     | 10022      | 2020-11-27 15:04:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 5     | 10022      | 2020-11-27 15:05:00 | 2020-11-27 17:17:17 | user1 | 1           |
| 6     | 10022      | 2020-11-27 15:06:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 7     | 10022      | 2020-11-27 15:07:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 8     | 10022      | 2020-11-27 15:08:00 |                     |       | 0           |
| 9     | 10023      | 2020-11-27 15:09:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 10    | 10024      | 2020-11-27 15:10:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 11    | 10024      | 2020-11-27 15:11:00 |                     |       | 0           |
| 12    | 10025      | 2020-11-27 15:12:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 13    | 10025      | 2020-11-27 15:13:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 14    | 10025      | 2020-11-27 15:14:00 | 2020-11-27 19:34:58 | user2 | 1           |
| 15    | 10025      | 2020-11-27 15:15:00 |                     |       | 0           |

b) results file

[
    { "documentId": 10022, "received": "2020-11-27 15:06:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10022, "received": "2020-11-27 15:07:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10023, "received": "2020-11-27 15:09:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10024, "received": "2020-11-27 15:10:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10025, "received": "2020-11-27 15:12:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10025, "received": "2020-11-27 15:13:00", "processed": "2020-11-27 19:34:58", "user": "user2"},
    { "documentId": 10025, "received": "2020-11-27 15:14:00", "processed": "2020-11-27 19:34:58", "user": "user2"}
]

(I know how to make that JSON, I'm looking for a smart way to select the updated data)

My question is...

In production, documents will have some 300k + rows, and each save run will affect hundreds of rows.

Is there a smarter way to update in bulk & select updated data?

Ideally, I'm looking for a way to pass [10022, 10022, 10023, 10024, 10025, 10025, 10025] onto the update in a way that will respect FIFO order and match only first listed documents, not more. Running hundreds of update statements does not seem ideal.

I'm using MySQL 8.0. The app is written in javascript (typescript) run on Node.js v12. I'm only looking for smarter MySQL solution. For suggestions with some logic in app layer, the answer doesn't have to be in JS, any pseudocode is appreciated too.

Fiddle w/ table schema here: https://www.db-fiddle.com/f/ecYLwEGpwZpPtec6VwzMBF/1

Thank you very much.

ΔO 'delta zero'
  • 3,506
  • 1
  • 19
  • 31
  • Answer to part a: `SELECT * FROM documents WHERE isProcessed=0` ?, so `UPDATE documents set processed=..., user=... where isProcess=0 and ....` – Luuk Nov 27 '20 at 20:05
  • Not ideal, since I only need 2 rows of documentId=10022 (first 2 where isProcess=0), but this would update all 3 occurrences. And so on... – ΔO 'delta zero' Nov 27 '20 at 20:14
  • That why i said `and ...`. to filter the records with needs to be updated. – Luuk Nov 27 '20 at 20:37
  • Any idea how to filter them the way I need on db layer? I might `SELECT * FROM documents WHERE isProcessed=0` and filter them in app layer, but that means fetching hundreds of thousands of to-be-filtered-out records from the db to app. – ΔO 'delta zero' Nov 27 '20 at 20:44
  • the filter should get it's input from the json you gave, but my knowledge from json is too limited to help you on that. – Luuk Nov 28 '20 at 08:32
  • Yeah, that's what I'm trying to figure out... – ΔO 'delta zero' Nov 28 '20 at 19:09
  • 1
    I think the singe quotes should be double quotes, see [this fiddle](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=0982e7515effbb37b746fba3f8150c07) – Luuk Nov 28 '20 at 20:07
  • Indeed, you need double quotes, see: https://stackoverflow.com/questions/14355655/jquery-parsejson-single-quote-vs-double-quote#:~:text=As%20per%20the%20API%20documentation,%2C%20single%20quotes%20aren%27t.&text=Go%20to%20www.Jsonlint.com,json%20is%20standard%20json%20format. – Luuk Nov 28 '20 at 20:14
  • Fiddle looks good. Any idea how to map it to `documents` tbl to match only that certain amount of rows in FIFO order? (Double quotes all the way, I've posted a JS object, not proper JSON, sorry about that) – ΔO 'delta zero' Nov 28 '20 at 20:21

2 Answers2

1

A way to find the records in table documents which need an update:

set @json='{ "user": "user2", "documents": [ 10022, 10022, 10023, 10024, 10025, 10025, 10025 ]}';

WITH RECURSIVE json as (
   SELECT 0 as i
   UNION ALL
   SELECT i+1 FROM json WHERE i<10
),
to_update as (
   SELECT json_extract(@json,CONCAT("$.documents[",i,"]")) as document, count(*) as count
   FROM json
   WHERE not json_extract(@json,CONCAT("$.documents[",i,"]")) is null
   group by document
)
SELECT * FROM (
   SELECT documents.*, row_number() over (partition by documents.documentId order by documents.documentId) nr, to_update.count count
   FROM documents
   INNER JOIN to_update ON to_update.document = documents.documentId
   WHERE NOT documents.isProcessed  
   ) x WHERE nr<=count
;

This will output:

+-------+------------+---------------------+-----------+------+-------------+----+-------+
| rowId | documentId | received            | processed | user | isProcessed | nr | count |
+-------+------------+---------------------+-----------+------+-------------+----+-------+
|     6 |      10022 | 2020-11-27 15:06:00 | NULL      | NULL |           0 |  1 |     2 |
|     7 |      10022 | 2020-11-27 15:07:00 | NULL      | NULL |           0 |  2 |     2 |
|     9 |      10023 | 2020-11-27 15:09:00 | NULL      | NULL |           0 |  1 |     1 |
|    10 |      10024 | 2020-11-27 15:10:00 | NULL      | NULL |           0 |  1 |     1 |
|    12 |      10025 | 2020-11-27 15:12:00 | NULL      | NULL |           0 |  1 |     3 |
|    13 |      10025 | 2020-11-27 15:13:00 | NULL      | NULL |           0 |  2 |     3 |
|    14 |      10025 | 2020-11-27 15:14:00 | NULL      | NULL |           0 |  3 |     3 |
+-------+------------+---------------------+-----------+------+-------------+----+-------+
Luuk
  • 12,245
  • 5
  • 22
  • 33
  • You're a genius, man! Thank you so much for your time! – ΔO 'delta zero' Nov 29 '20 at 15:38
  • PS: I've simplified the query using JSON_TABLE instead of a RECURSIVE expressions, see [the fiddle](https://www.db-fiddle.com/f/hLZCuvqUNL4GzGeKvkmRX6/1). That `row_number() over (partition...` is simply genius :-) – ΔO 'delta zero' Nov 29 '20 at 16:00
  • @ΔO'deltazero': I am just learning/experimenting with those JSON functions in MySQL so, – Luuk Nov 29 '20 at 16:06
0

Meanwhile, I've found a better, still not very satisfying way to approach this. Instead of "update then select", I "select then update":

(SELECT rowId, documentId, received FROM documents WHERE documentId = 10022 AND !isProcessed ORDER BY rowID LIMIT 2)
  UNION 
(SELECT rowId, documentId, received FROM documents WHERE documentId = 10023 AND !isProcessed ORDER BY rowID LIMIT 1)
  UNION 
(SELECT rowId, documentId, received FROM documents WHERE documentId = 10024 AND !isProcessed ORDER BY rowID LIMIT 1)
  UNION 
(SELECT rowId, documentId, received FROM documents WHERE documentId = 10025 AND !isProcessed ORDER BY rowID LIMIT 3);

Then I let rowIds = result.map(r => r.rowId) and update those rowIds.

UPDATE documents SET processed = NOW(), user = 'user2' WHERE rowId IN (6, 7, 9, 10, 12, 13, 14);

Fiddle here: https://www.db-fiddle.com/f/uKPUdc1HoJMGej8JdhW3Tp/2

This works better, because the relation between select & update is based on actual IDs instead of a timestamp. And I suppose running an UNION of few hundred SELECTs will be more efficient than few hundred UPDATEs.

However, a generated query of a couple hundred SELECTs in UNION still doesn't appeal to me very much.

Still looking for better options...

ΔO 'delta zero'
  • 3,506
  • 1
  • 19
  • 31