3

Here is my structure (with values):

user_eval_history table

 user_eval_id | user_id | is_good_eval                             
--------------+---------+--------------
            1 |       1 |            t
            2 |       1 |            t
            3 |       1 |            f
            4 |       2 |            t

user_metrics table

 user_metrics_id | user_id | nb_good_eval | nb_bad_eval
-----------------+---------+--------------+-------------
               1 |       1 |            2 |           1
               2 |       2 |            1 |           0

For access time (performance) reasons I want to avoid recomputing user evaluation from the history again and again. I would like to store/update the sums of evaluations (for a given user) everytime a new evaluation is given to the user (meaning everytime there is an INSERT in the user_eval_history table I want to update the user_metrics table for the corresponding user_id).

I feel like I can achieve this with a trigger and a stored procedure but I'm not able to find the correct syntax for this.

I think I need to do what follows:

1. Create a trigger on user metrics:

CREATE TRIGGER update_user_metrics_trigger AFTER INSERT
    ON user_eval_history
        FOR EACH ROW
    EXECUTE PROCEDURE update_user_metrics('user_id');

2. Create a stored procedure update_user_metrics that

2.1 Computes the metrics from the user_eval_history table for user_id

SELECT 
  user_id,
  SUM( CASE WHEN is_good_eval='t' THEN 1 ELSE 0) as nb_good_eval,
  SUM( CASE WHEN is_good_eval='f' THEN 1 ELSE 0) as nb_bad_eval
FROM user_eval_history
WHERE user_id = 'user_id' -- don't know the syntax here

2.2.1 Creates the entry into user_metrics if not already existing

INSERT INTO user_metrics
  (user_id, nb_good_eval, nb_bad_eval) VALUES
  (user_id, nb_good_eval, nb_bad_eval) -- Syntax?????

2.2.2 Updates the user_metrics entry if already existing

UPDATE user_metrics SET
  (user_id, nb_good_eval, nb_bad_eval) = (user_id, nb_good_eval, nb_bad_eval)

I think I'm close to what is needed but don't know how to achieve this. Especially I don't know about the syntax.

Any idea?

Note: Please, no "RTFM" answers, I looked up for hours and didn't find anything but trivial examples.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Nitseg
  • 1,267
  • 1
  • 12
  • 21
  • Actually a bit of an RTFM sort of answer, but you probably don't know where to look. Here: http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql – sn00k4h Dec 23 '14 at 22:52
  • Thanks. Reuse of the SELECT result in the update/insert statement is not answered but I got a part of the answer. I had seen this example from the official doc. Does that mean that we can have race condition on a table within a function? I thought postGreSQL was using transactions transparently with functions. – Nitseg Dec 23 '14 at 23:38
  • You can actually join your SELECT query with your UPDATE query. Look at the bottom of the page here: http://www.postgresql.org/docs/9.1/static/sql-update.html . Alternatively you can also loop through the result as such: http://www.postgresql.org/docs/8.0/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING . You are correct that the whole function will be executed as 1 transaction. – sn00k4h Dec 24 '14 at 00:00
  • Thanks a lot. Well I'm a little confused about transactions. In the manual example there is an attempt to update the record. If not possible we try to insert. If there is an exception, that means an insert has been done in the meantime by another transaction. So we go back to update. So that means the table wasn't protected from INSERT by another transaction. So the solution in the doc doesn't guarantee data integrity. In my example that means if two transactions are inserting user_eval record then there is no guarantee I will be updating the user_metrics with up-to-date evals, right? – Nitseg Dec 24 '14 at 00:26
  • No, if 2 transactions try to insert a row with the same unique key (which I assume you will have), one of them will fail and throw the exception. In the sample code it will then retry the function which will first re-attempt the UPDATE query. To make sure you do the UPDATE with the correct, updated value, you have to make sure your SELECT query is inside the LOOP (or is a part of the UPDATE query). – sn00k4h Dec 24 '14 at 01:00
  • According to this: http://stackoverflow.com/questions/15939902/is-select-or-insert-in-a-function-prone-to-race-conditions/15950324#15950324 "The time window between the SELECT and the INSERT within one query is super tiny. If you don't have heavy concurrent load, or if you can live with an exception once a year, you could just ignore the case". I need to avoid errors. I think row-level lock could be the perfect answer for me. Meaning, SELECT ... FROM user_metrics WHERE user_id = whatever FOR UPDATE. This way no other transaction can update this row until this current one is done. – Nitseg Dec 24 '14 at 01:28
  • 1
    Well, if you do the SELECT as part of the UPDATE query then you wouldn't need to worry about doing explicit locking as an implicit row-level lock will already be done for you, and the values you get from the SELECT will be the "current" values as of the time you run the UPDATE. – sn00k4h Dec 24 '14 at 01:37
  • Actual table definitions (what you get with `\t tbl` in psql) would be useful, showing data types and constraints and everything. – Erwin Brandstetter Dec 24 '14 at 01:54

1 Answers1

3

First, revisit the assumption that maintaining an always current materialized view is a significant performance gain. You add a lot of overhead and make writes to user_eval_history a lot more expensive. The approach only makes sense if writes are rare while reads are more common. Else, consider a VIEW instead, which is more expensive for reads, but always current. With appropriate indexes on user_eval_history this may be cheaper overall.

Next, consider an actual MATERIALIZED VIEW (Postgres 9.3+) for user_metrics instead of keeping it up to date manually, especially if write operations to user_eval_history are very rare. The tricky part is when to refresh the MV.

Your approach makes sense if you are somewhere in between, user_eval_history has a non-trivial size and you need user_metrics to reflect the current state exactly and close to real-time.

Still on board? OK. First you need to define exactly what's allowed / possible and what's not. Can rows in user_eval_history be deleted? Can the last row of a user in user_eval_history be deleted? Probably yes, even if you would answer "No". Can rows in user_eval_history be updated? Can user_id be changed? Can is_good_eval be changed? If yes, you need to prepare for each of these cases.

Assuming the trivial case: INSERT only. No UPDATE, no DELETE. There is still the possible race condition you have been discussing with @sn00k4h. You found an answer to that, but that's really for INSERT or SELECT, while you have a classical UPSERT problem: INSERT or UPDATE:

FOR UPDATE like you considered in the comments is not the silver bullet here. UPDATE user_metrics ... locks the row it updates anyway. The problematic case is when two INSERTs try to create a row for a new user_id concurrently. You cannot lock key values that are not present in the unique index, yet, in Postgres. FOR UPDATE can't help. You need to prepare for a possible unique violation and retry as discussed in these linked answers:

Code

Assuming these table definitions:

CREATE TABLE user_eval_history (
   user_eval_id serial PRIMARY KEY
 , user_id int NOT NULL
 , is_good_eval boolean NOT NULL
);

CREATE TABLE user_metrics (
   user_metrics_id -- seems useless
 , user_id int PRIMARY KEY
 , nb_good_eval int NOT NULL DEFAULT 0
 , nb_bad_eval int NOT NULL DEFAULT 0
);

First, you need a trigger function before you can create a trigger.

CREATE OR REPLACE FUNCTION trg_user_eval_history_upaft()
   RETURNS trigger
   LANGUAGE plpgsql AS
$func$
BEGIN
LOOP
   IF NEW.is_good_eval THEN
      UPDATE user_metrics
      SET    nb_good_eval = nb_good_eval + 1
      WHERE  user_id = NEW.user_id;
   ELSE
      UPDATE user_metrics
      SET    nb_bad_eval = nb_bad_eval + 1
      WHERE  user_id = NEW.user_id;
   END IF;
   EXIT WHEN FOUND;

   BEGIN  -- enter block with exception handling
      IF NEW.is_good_eval THEN
         INSERT INTO user_metrics (user_id, nb_good_eval)
         VALUES (NEW.user_id, 1);
      ELSE
         INSERT INTO user_metrics (user_id, nb_bad_eval)
         VALUES (NEW.user_id, 1);
      END IF;
      RETURN NULL;  -- returns from function, NULL for AFTER trigger

   EXCEPTION WHEN UNIQUE_VIOLATION THEN     -- user_metrics.user_id is UNIQUE
      RAISE NOTICE 'It actually happened!'; -- hardly ever happens
   END;    
END LOOP;
RETURN NULL;  -- NULL for AFTER trigger
END
$func$;

In particular, you don't pass user_id as parameter to the trigger function. The special variable NEW holds values of the triggering row automatically. Details in the manual here.

Trigger:

CREATE TRIGGER upaft_update_user_metrics
AFTER INSERT ON user_eval_history
FOR EACH ROW EXECUTE PROCEDURE trg_user_eval_history_upaft();
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Wooooow, amazing answer. Very clear, very detailed. Thanks a lot. Well, given what you are saying I think my approach fully makes sense: intensive user_metrics read operations, quite rare user_eval write operations, need for user_metrics values to fully reflect user_eval values in real time. To avoid the UPSERT scenario I can create a "all to zero values" user_metrics record along with user_account creation. This way there won't be any user_metrics insert when inserting user_eval records, only user_metrics updates. I tried to answer this to sn00k4h but wasn't allowed to by SO. – Nitseg Dec 24 '14 at 09:06
  • @user2383627: Avoiding the UPSERT this way would be a good solution. – Erwin Brandstetter Dec 24 '14 at 13:36