0

I have a simple procedure where I need to process records of a table, and ideally run multiple instances of the process without processing the same record. The way I've done this with MySQL is fairly common (although I perceive the token field to be more of a hack):

Adding a couple of fields to the table:

CREATE TABLE records (
    id INTEGER PRIMARY KEY AUTO_INCREMENT,
    ...actual fields...

    processed_at DATETIME DEFAULT NULL,
    process_token TEXT DEFAULT NULL
);

And then a simple processing script:

process_salt = md5(rand()) # or something like a process id

def get_record():
    token = md5(microtime + process_salt)
    db.exec("UPDATE records SET process_token = ?
             WHERE processed_at IS NULL LIMIT 1", token)
    return db.exec("SELECT * FROM records WHERE token = ?", token)

while (row = get_record()) is valid:
    # ...do processing on row...

    db.exec("UPDATE records SET processed_at = NOW(), token = NULL
             WHERE id = ?", row.id)

I'm implementing such a process in a system which uses a PostgreSQL database. I know Pg could be considered more mature than MySQL with regards to locking thanks to MVCC - can I use row-locking or some other feature in Pg instead of the token field?

Ross
  • 46,186
  • 39
  • 120
  • 173

2 Answers2

1

This approach will work with PostgreSQL but it'll tend to be pretty inefficient as you're updating each row twice - each update requires two transactions, two commits. The cost of this can be mitigated somewhat by using a commit_delay and possibly disabling synchronous_commit, but it's still not going to be fast unless you have a non-volatile write-back cache on your storage subsystem.

More importantly, because you're committing the first update there is no way to tell the difference between a worker that's still working on the job and a worker that has crashed. You could probably set the token to the worker's process ID if all workers are on the local machine then scan for missing PIDs occasionally but that's cumbersome and race-condition prone, not to mention the problems with pid re-use.

I would recommend that you adopt a real queuing solution that is designed to solve these problems, like ActiveMQ, RabbitMQ, ZeroMQ, etc. PGQ may also be of significant interest.

Doing queue processing in a transactional relational database should be easy, but in practice it's ridiculously hard to do well and get right. Most of the "solutions" that look sensible at a glance turn out to actually serialize all work (so only one of many queue workers is doing anything at any given time) when examined in detail.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Yeah, this approach always felt like hammering a nail with a spanner - I've been meaning to look at ZeroMQ for a long time so I'll take that route. Thanks for your insight! – Ross Apr 11 '13 at 20:08
1

You can use SELECT ... FOR UPDATE NOWAIT which will obtain an exclusive lock on the row, or report an error if it is already locked.

Dondi Michael Stroma
  • 4,668
  • 18
  • 21