1

Our application has a table called cargo_items. It can be seen as kind of a queue to later process these items. Initially there was a SINGLE job which took 3000 entries and processed them one after another. Later on, somebody decided to start 3 other instances of the same job. What happened is quite obvious, many of the items were processed twice.

My job is to make these processes working correctly if at the same time many instances are running. The solution I am going for now is to mark 3000 entries in the database with a job_id and later on fetch all of these entities and process them isolated from the other processes.

My current approach for flagging this rows is the following:

UPDATE cargo_item item
SET job_id = 'SOME_UUID', job_ts = now()
FROM  ( 
   SELECT id
   FROM   cargo_item
   WHERE  state='NEW' AND job_id is null 
   LIMIT  3000
   FOR UPDATE
   ) sub
WHERE  item.id = sub.id;

Basically this approach locks the 3000 rows for a update. I am not sure if it is a good approach though.

On another thread I read about using advisory locks for this scenario.

What do you guys think about the current approach and using advisory lock instead?

UPDATE

As suggested, I'd adapt the update statement like this:

UPDATE cargo_item item
SET job_id = 'SOME_UUID', job_ts = now()
FROM  ( 
   SELECT id
   FROM   cargo_item
   WHERE  state='NEW' AND job_id is null 
   ORDER  BY id
   LIMIT  3000
   FOR UPDATE
   ) sub
WHERE  item.id = sub.id;

Thx Erwin and Tometzky for the hint. Nevertheless I wonder if the way I am trying to solve the problem is a good one? Are there different approaches you'd think of?

Community
  • 1
  • 1
u6f6o
  • 2,050
  • 3
  • 29
  • 54
  • The answer would need some numbers to be be complete. How many rows are there? is your actual operations as simple (and fast) as what you display here? How many calls per time unit do you plan for? – Erwin Brandstetter Dec 13 '14 at 01:10
  • The table is not that big. It's approx. 500k rows but gets cleaned up from time to time. Currently 4 jobs are started with cron at the same time. Happens every hour. The problem is, these jobs use a loop to mark the next 3000 entries for processing until no remaining entries are left. – u6f6o Dec 13 '14 at 22:46
  • 1
    For a simple update as in your example, you are probably better off with a single update for the whole table. If you have to protect against concurrent write operations, take out a table lock at the begin of the transaction. With "just" 500k rows, the update should be a matter of seconds at most. There are ways to optimize updating the whole table. [Example.](http://stackoverflow.com/questions/10412078/add-new-column-without-table-lock/10412790#10412790) [Example with more details.](http://dba.stackexchange.com/questions/52517/best-way-to-populate-a-new-column-in-a-large-table/52531#52531) – Erwin Brandstetter Dec 14 '14 at 15:06
  • Not sure how I can achieve this with one single update. The job_id differs for each job instance running. The idea is to let each job flag a certain subset of all the entries with state 'NEW' to get an isolated set of cargo_items to process. Assuming I'd have 4 jobs, the first job would mark the first 3000 entries with UUID_1, the other jobs with UUID_2, UUID_3 etc. – u6f6o Dec 14 '14 at 18:00
  • 1
    I added a solution for that, too. – Erwin Brandstetter Dec 14 '14 at 18:37
  • I really like the single update approach but it would need some more effort to achieve it on the job/client site. Currently each job gets a fresh uuid on startup, meaning there is no fixed relation between job instance and job_id, thus I'd have to adapt this. Besides, this single update should be executed only by the first job I suppose, meaning that I'd have to create a job that runs before the others and prepares the batches or alternatviely the first job blocks the others until the update is done (jobs are java commandline jobs). Please correct me if I got something wrong. – u6f6o Dec 14 '14 at 18:59
  • Think I'll try out the initial solution first with the added order by on id and check how it behaves. If it does not work out so well, I'll go for the single update solution. – u6f6o Dec 14 '14 at 19:00

3 Answers3

1

You will have deadlocks with this approach. You could avoid them by simply using order by id in subquery.

But it will prevent any concurrent running of this queries, as concurrent queries will always try first to mark the lowest free id, and block until the first client will commit. I don't think this is a problem if you process say less than one batch per second.

You don't need advisory locks. Avoid them if you can.

Tometzky
  • 22,573
  • 5
  • 59
  • 73
1

In the related answer you are referring to:

The objective is to lock one row at a time. This works fine with or without advisory locks, because there is no chance for a deadlock - as long as you don't try to lock more rows in the same transaction.

Your example is different in that you want to lock 3000 rows at a time. There is potential for deadlock, except if all concurrent write operations lock rows in the same consistent order. Per documentation:

The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order.

Implement that with an ORDER BY in your subquery.

UPDATE cargo_item item
SET job_id = 'SOME_UUID', job_ts = now()
FROM  ( 
   SELECT id
   FROM   cargo_item
   WHERE  state='NEW' AND job_id is null 
   ORDER  BY id
   LIMIT  3000
   FOR UPDATE
   ) sub
WHERE  item.id = sub.id;

This is safe and reliable, as long as all transactions acquire locks in the same order and concurrent updates of the ordering columns are not to be expected. (Read the yellow "CAUTION" box at the end of this chapter in the manual.) So this should be safe in your case, since you are not going to update the id column.

Effectively only one client at a time can manipulate rows this way. Concurrent transactions would try to lock the same (locked) rows and wait for the first transaction to finish.

Advisory locks are useful if you have many or very long running concurrent transactions (doesn't seem you do). With only a few, it will be cheaper overall to just use above query and have concurrent transactions wait for their turn.

All in one UPDATE

It seems concurrent access isn't a problem per se in your setup. Concurrency is an issue created by your current solution.

Instead, do it all in a single UPDATE. Assign batches of n numbers (3000 in the example) to each UUID and update all at once. Should be fastest.

UPDATE cargo_item c
SET    job_id = u.uuid_col
     , job_ts = now()
FROM  (
   SELECT row_number() OVER () AS rn, uuid_col
   FROM   uuid_tbl WHERE  <some_criteria>  -- or see below
   ) u
JOIN (
   SELECT (row_number() OVER () / 3000) + 1 AS rn, item.id 
   FROM   cargo_item
   WHERE  state = 'NEW' AND job_id IS NULL
   FOR    UPDATE   -- just to be sure
   ) c2 USING (rn)
WHERE  c2.item_id = c.item_id;

Major points

  • Integer division truncates. You get 1 for the first 3000 rows, 2 for the next 3000 rows. etc.

  • I pick rows arbitrarily, you could apply ORDER BY in the window for row_number() to assign certain rows.

  • If you don't have a table of UUIDs to dispatch (uuid_tbl), use a VALUES expression to supply them. Example.

  • You get batches of 3000 rows. The last batch will be short of 3000 if you don't find a multiple of 3000 to assign.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

What you need is advisory locks.

   SELECT id
   FROM   cargo_item
   WHERE  pg_try_advisory_lock(id)
   LIMIT  3000
   FOR UPDATE

will place an advisory lock on the rows and the other processes will not see the rows if the same pg_try_advisory_lock(id) function is used in the where. Remember to unlock the rows using pg_advisory_unlock

Joe Love
  • 5,594
  • 2
  • 20
  • 32