0

I'm implementing an importer that's using a pool of workers (powered by Celery and RabbitMQ). The data that's being imported includes user data, identified by a unique user ID, and is stored to a PostgreSQL database.

Usually I receive a lot of import files, and those files are processed concurrently. In multiple files, there can be data for the same user ID. This results in the situation where 2 imports try to process data for the same user ID concurrently.

If data for the specific user ID already exists, there's no issue: The database row is locked by utilizing SELECT ... FOR UPDATE. However if there's no data associated with the ID, both imports can run into the troublesome situation where both try to INSERT — using the same user ID.

As the user ID field has a UNIQUE constraint, the transactions fail.

How can I prevent this from happening? Are there other options than locking the full table?

stschindler
  • 937
  • 7
  • 28
  • 2
    Something like this? http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql – xQbert Jun 02 '16 at 13:03
  • @xQbert Not really, my issue is that two concurrent transactions try to `INSERT` both *at the same time*, where one will of course always fail. This happens because I don't know of any reliable way of locking this operation, like: *Lock any future `INSERT` operations that try to insert with user ID = 1234*. – stschindler Jun 02 '16 at 13:13
  • 1
    But an upsert as @xQbert has suggested really does work here. – e4c5 Jun 02 '16 at 13:35
  • Right, I got the other question wrong, sorry. Too bad Django doesn't support that. Thanks! – stschindler Jun 02 '16 at 15:23
  • Use `update_or_create()` (or `get_or_create()`). If you have a uniqueness constraint on the user ID this will be atomic. See [this answer](http://stackoverflow.com/a/22095136/2395796) for a good description. – Kevin Christopher Henry Jun 02 '16 at 16:53

0 Answers0