PostgreSQL concurrent transaction issues

Question

I'm currently building a crawler. Multiple crawling workers access the same PostgreSQL database. Sadly I'm encountering issues with the main transaction presented here:

BEGIN ISOLATION LEVEL SERIALIZABLE;
    UPDATE webpages
    SET locked = TRUE
    WHERE url IN 
        (
            SELECT DISTINCT ON (source) url
            FROM webpages
            WHERE
                (
                    last IS NULL
                    OR
                    last < refreshFrequency
                )
                AND
                locked = FALSE
            LIMIT limit
        )
    RETURNING *;
COMMIT;

url is a URL (String)
source is a domain name (String)
last is the last time a page was crawled (Date)
locked is a boolean that is set to indicate that a webpage is currently being crawled (Boolean)

I tried two different transaction isolation levels:

ISOLATION LEVEL SERIALIZABLE, I get errors like could not serialize access due to concurrent update
ISOLATION LEVEL READ COMMITTED, I get duplicate urls from concurrent transactions due to the data being "frozen" from the time the transaction was first committed (I think)

I'm fairly new to PostgreSQL and SQL in general so I'm really not sure what I could do to fix this issue.

Update:
PostgreSQL version is 9.2.x.
webpage table definition:

CREATE TABLE webpages (
  last timestamp with time zone,
  locked boolean DEFAULT false,
  url text NOT NULL,
  source character varying(255) PRIMARY KEY
);

@ErwinBrandstetter: I described the issues with the different isolation levels as well as the different column types and descriptions below the query. I'm using PostgreSQL 9.2. — m_vdbeek, Apr 22 '15 at 20:05
There are clean solutions, depending on exact requirements, which are not clear. In particular, this makes no sense without explanation: `DISTINCT ON (source) url` , and no `ORDER BY`. What is it supposed to do? Do you want to retrieve *one* (arbitrary) `url` per `source`? — Erwin Brandstetter, Apr 22 '15 at 22:42
I want to spread out the requests I make by source so as to not flood the same hosts with requests. The reason I'm not using a ORDER BY is that I need to find a different source for each url and the reason I did not use GROUP BY was because I need the url to be returned but it would not have been possible because if you GROUP BY source then the url is not in the SELECT. — m_vdbeek, Apr 22 '15 at 22:51
The actual table definition showing all constraints would help (`\d webpages` in psql). The explanation should also be in the question. Please edit to give a complete picture. You can either solve this with `FOR UPDATE` row level locks or with advisory locks, depending on the whole picture ... — Erwin Brandstetter, Apr 22 '15 at 22:55

score 3 · Accepted Answer · edited May 23 '17 at 11:46

Clarification

The question leaves room for interpretation. This is how I understand the task:

Lock a maximum of limit URLs which fulfill some criteria and are not locked, yet. To spread out the load on sources, every URL should come from a different source.

DB design

Assuming a separate table source: this makes the job faster and easier. If you don't have such a table, create it, it's the proper design anyway:

CREATE TABLE source (
  source_id serial NOT NULL PRIMARY KEY
, source    text NOT NULL
);

CREATE TABLE webpage (
  source_id int NOT NULL REFERENCES source
  url       text NOT NULL PRIMARY KEY
  locked    boolean NOT NULL DEFAULT false,        -- may not be needed
  last      timestamp NOT NULL DEFAULT '-infinity' -- makes query simpler
);

Alternatively you can use a recursive CTE efficiently:

Optimize GROUP BY query to retrieve latest record per user

Basic solution with advisory locks

I am using advisory locks to make this safe and cheap even in default read committed isolation level:

UPDATE webpage w
SET    locked = TRUE
FROM  (
   SELECT (SELECT url
           FROM   webpage
           WHERE  source_id = s.source_id
           AND   (last >= refreshFrequency) IS NOT TRUE
           AND    locked = FALSE
           AND    pg_try_advisory_xact_lock(url)  -- only true is free
           LIMIT  1     -- get 1 URL per source
          ) AS url
   FROM  (
      SELECT source_id  -- the FK column in webpage
      FROM   source
      ORDER  BY random()
      LIMIT  limit      --  random selection of "limit" sources
      ) s
   FOR    UPDATE
   ) l
WHERE  w.url = l.url
RETURNING *;

Alternatively, you could work with only advisory locks and not use the table column locked at all. Basically just run the the SELECT statement. Locks are kept until the end of the transaction. You can use pg_try_advisory_lock() instead to keep the locks till the end of the session. Only UPDATE once to set last when done (and possible release the advisory lock).

Other major points

In Postgres 9.3 or later you would use a LATERAL join instead of the correlated subquery.
I chose pg_try_advisory_xact_lock() because the lock can (and should) be released at the end of the transaction. Detailed explanation for advisory locks:
- Postgres UPDATE ... LIMIT 1
You get less than limit rows if some sources have no more URL to crawl.
The random selection of sources is my wild but educated guess, since information is not available. If your source table is big, there are faster ways:
- Best way to select random rows PostgreSQL
refreshFrequency should really be called something like lastest_last, since it's not a "frequency", but a timestamp or date.

Recursive alternatve

To get the full limit number of rows if available, use a RECURSIVE CTE and iterate all sources until you found enough or no more can be found.

As I mentioned above, you may not need the column locked at all and operate with advisory locks only (cheaper). Just set last at the end of the transaction, before you start the next round.

WITH RECURSIVE s AS (
   SELECT source_id, row_number() OVER (ORDER BY random()) AS rn
   FROM source  -- you might exclude "empty" sources early ...
   )
, page(source_id, rn, ct, url) AS (
   SELECT 0, 0, 0, ''::text   -- dummy init row
   UNION ALL
   SELECT s.source_id, s.rn
        , CASE WHEN t.url <> ''
               THEN p.ct + 1
               ELSE p.ct END  -- only inc. if url found last round
        , (SELECT url
           FROM   webpage
           WHERE  source_id = t.source_id
           AND   (last >= refreshFrequency) IS NOT TRUE
           AND    locked = FALSE  -- may not be needed
           AND    pg_try_advisory_xact_lock(url)  -- only true is free
           LIMIT  1           -- get 1 URL per source
          ) AS url            -- try, may come up empty
   FROM   page p
   JOIN   s ON s.rn = p.rn + 1
   WHERE  CASE WHEN p.url <> ''
               THEN p.ct + 1
               ELSE p.ct END < limit  -- your limit here
   )
SELECT url
FROM   page
WHERE  url <> '';             -- exclude '' and NULL

Alternatively, if you need to manage locked, too, use this query with the above UPDATE.

Currently, source is just a simple column with a string type that is repeated for each webpage record. — m_vdbeek, Apr 23 '15 at 12:36
In this two table (one for sources, one for webpages) architecture, wouldn't there be an issue when the ```ORDER BY random()``` returns sources for which no page matches the ```(last >= refreshFrequency) IS NOT TRUE``` condition ? — m_vdbeek, Apr 23 '15 at 13:04
I also added the ```webpages``` table to my original question post. — m_vdbeek, Apr 23 '15 at 13:08
@m_vdbeek: I added an alternative solution and a lot more information. — Erwin Brandstetter, Apr 24 '15 at 03:54

score 0 · Answer 2 · answered Apr 22 '15 at 19:55

First try:

UPDATE webpages
SET locked = TRUE
WHERE url IN 
    (
        SELECT DISTINCT ON (source) url
        FROM webpages
        WHERE
            (
                last IS NULL
                OR
                last < refreshFrequency
            )
            AND
            locked = FALSE
        LIMIT limit
    )
    WHERE
       (
           last IS NULL
           OR
           last < refreshFrequency
        )
        AND
        locked = FALSE

You are trying to update only records with locked = FALSE.
Imagine that there are the following records in the table:

URL       locked
----------------
A         false
A         true

The subquery in your update will retrun A.
Then the outer update will perform:

   UPDATE webpages
    SET locked = TRUE
    WHERE url IN ( 'A' )

and in effect all records in the table containing url=A will be updated,
regardess of their values in locked column.

You need to apply to the outer update the same WHERE condition as in the subquery.

Should it still be wrapped in a transaction ? If yes, at which isolation level ? — m_vdbeek, Apr 22 '15 at 20:01