0

I have multiple threads that save files on disk and put that information to DB.

On the other side of app I have multiple threads that read from DB this information and process mentioned files one at a time sorted by file_id:

SELECT * FROM files_to_process ORDER BY file_id

What I've invented is to make a PROCESSING_STATUS column which has four statusesNEW, PROCESSING, FAILED, SUCCESS.

Every worker is supposed to read ONLY one row from DB sorted by ID with status NEW and update immediately to status PROCESSING, so the other workers won't process the same file.

But, something tells me that I might end up with some race condition.

Will transactions solve this problem?

Unfortunately I can't make all operation inside transaction since processing files takes a lot of time and transaction pool will be exhausted, so I have to make two transactions in the following order.

  1. [In Transaction] Fetch row and update to status PROCESSING
  2. [No Transaction] Process File
  3. [In Transaction] Update final state to SUCCESS or FAILED depending on result
ashur
  • 4,177
  • 14
  • 53
  • 85
  • Maybe im way off but could you just use a mutex on the new as it sounds like thats a very quick operation? – Pearl Jun 11 '20 at 11:35
  • @Pearl Could you elaborate your suggestion? I'm not sure I understand, while I have basic idea about mutex. Some simplistic example? – ashur Jun 11 '20 at 11:38
  • When a thread needs to access the database for a new record then update it to proccess it, use a mutex on that code portion so only one thread does it at a time. As long as there is only 1 of your app running at a time this should work? – Pearl Jun 11 '20 at 11:40

2 Answers2

1

Quite annoyingly, UPDATE does not take a LIMIT in PostgreSQL.

You can do something like this:

update files_to_process set processing_status='PROCESSING' where file_id = (
    SELECT file_id FROM files_to_process 
      WHERE processing_status = 'NEW' 
      ORDER BY file_id FOR UPDATE SKIP LOCKED LIMIT 1
) returning *;

With this formulation, there should be no race conditions. You would run this in a transaction by itself (or under autocommit, just run the statement and it will automatically form its own transaction).

But rather than using just 'PROCESSING', I would probably make it 'PROCESSING by machine worker7 PID 19345' or something like that. Otherwise, how will you known when processing failed if fails in an unclean way? (That is the nice thing about doing it in one transaction, failures should rollback themselves).

Unfortunately I can't make all operation inside transaction since processing files takes a lot of time and transaction pool will be exhausted

But you should never have more outstanding transactions than you have CPUs available to do work. Unless you have a very large compute farm, you should be able to make the pool large enough. But the big problem with this approach is that you have no visibility into what is happening.

For the two transaction approach, for performance you will probably want to make a partial index:

create index on files_to_process (file_id ) where processing_status = 'NEW';

Otherwise you will have to dig through all of the completed ones with low file_id to find the next NEW one, and eventually that will get slow. You might also need to VACUUM the table more aggressively than the default.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • I think you solution works, however I had to modify it a little, since I needed to make a join to get field from another table. Do you think this should be fine? Sorry for formatting. `UPDATE file.thumbnails SET status='NEW' FROM (SELECT ft.id, ff.stored_path FROM file.files ff JOIN file.thumbnails ft ON ff.id = ft.file_id WHERE ft.status = 'NEW' ORDER BY ff.id FOR UPDATE SKIP LOCKED LIMIT 1) sub WHERE file_id = sub.id RETURNING sub.*;` – ashur Jun 12 '20 at 12:28
  • Regarding your suggestion: `But you should never have more outstanding transactions than you have CPUs available to do work.` You suggest that it is acceptable to have about 3 to 5 workers with long running operations in transaction and other to perform and serve requests from front-end? – ashur Jun 12 '20 at 12:49
  • Whether 3 to 5 workers is a problem would depend on how long running they are (minutes? hours? days?), and what the other non-queue work is. If they pose a problem, the problem would be with vacuuming, not with the connection pool. If there are only 3 to 5, I wouldn't use a connection pool for them in the first place. Either not use one at all, or bypass it for the queue workers. – jjanes Jun 12 '20 at 13:32
0

Try a mutex, simplistic example:

try {
  mutex.acquire();
  try {
    // access and update record to processing
  } finally {
    mutex.release();
  }
} catch(InterruptedException ie) {
  // ...
}

Depending on your code you may lock it various ways, see: Is there a Mutex in Java?

EDIT:

Sorry thought this was a c++ quesiton, this is the java version

Pearl
  • 392
  • 2
  • 12