2

I am writing a PostgreSQL trigger function that will allow me to do an ETL transformation on a data. I have a table with fields (id, price, EFF_FROM, EFF_TO), and if I only insert (id, price, EFF_FROM), EFF_TO needs to be calculated automatically as the next EFF_FROM of this id (-1 day) if it exists, and "5999-12-31" if it does not.

To update the table, I am using UPDATE ... FROM inside of a trigger function, but it seems to update all EFF_TO values in the table with the last given EFF_FROM value.

The trigger function that I use looks like this:

CREATE OR REPLACE FUNCTION prices_schema.prices_etl() RETURNS TRIGGER AS $$
BEGIN
    UPDATE prices_schema.prices
    SET EFF_TO = subquery.next_eff_from
    FROM (
        SELECT COALESCE(
            LEAD(EFF_FROM, 1)
            OVER (
                ORDER BY EFF_FROM ASC
            ),
            TO_DATE('6000-00-00', 'YYYY-MM-DD')
        ) - 1 AS next_eff_from FROM prices_schema.prices
    ) AS subquery;
    
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER after_insert_prices
AFTER INSERT ON prices_schema.prices
FOR EACH ROW EXECUTE PROCEDURE prices_schema.prices_etl();

Load some test data:

INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 100, '2017-01-12');
    
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 150, '2017-02-09');
    
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 125, '2017-01-27');

Execute:

SELECT * FROM prices_schema.prices
    ORDER BY EFF_FROM ASC;

Result:

id price EFF_FROM EFF_TO
1 100 2017-01-12 2017-01-26
1 125 2017-01-27 2017-01-26
1 150 2017-02-09 2017-01-26

EFF_TO obviously doesn't look as expected to. What amazes me more is that calling this query:

SELECT eff_from, COALESCE(
            LEAD(EFF_FROM, 1)
            OVER (
                ORDER BY EFF_FROM ASC
            ),
            TO_DATE('6000-00-00', 'YYYY-MM-DD')
        ) - 1 AS next_eff_from FROM prices_schema.prices;

Gives me the correct result:

EFF_FROM next_eff_from
2017-01-12 2017-01-26
2017-01-27 2017-02-08
2017-02-09 5999-12-31

I bet this has something to do with how triggers & trigger functions work, but I wasn't able to find the right thing. I would like to be pointed out in the right direction.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Why the odd spelling for your column names? `id` but `EFF_FROM`? Do you use double-quoted identifiers to preserve upper-case letters? Also, please start any such question with your version of Postgres. `SELECT version();` helps. – Erwin Brandstetter May 21 '23 at 22:22

2 Answers2

2

Problems

Your trigger does something else entirely from what you describe.

Firstly, subquery in your UPDATE computes a new value for every single row in the table. Expensive, and unnecessary. We just need the one.

Next, there is no WHERE clause to constrain the UPDATE to the row at hand. So it updates every row in the table - multiple times. That's hugely expensive nonsense.

Multi-row INSERT

A statement-level AFTER trigger using a transition table achieves what you were aiming for: all rows within the same multi-row INSERT are adapted while seeing all other rows of the same bulk insert.

CREATE OR REPLACE FUNCTION prices_schema.prices_set_all_eff_to()
  RETURNS trigger
  LANGUAGE plpgsql AS
$func$
BEGIN
   UPDATE prices_schema.prices p
   SET    eff_to = COALESCE(p1.next_eff_to, 'infinity')
   FROM   news n
   LEFT   JOIN LATERAL (
      SELECT p1.eff_from - 1 AS next_eff_to
      FROM   prices_schema.prices p1
      WHERE  p1.id = n.id
      AND    p1.eff_from > n.eff_from
      ORDER  BY p1.eff_from
      LIMIT  1
      ) p1 ON true
   WHERE  n.eff_to IS NULL
   AND    p.prices_id = n.prices_id;

   RETURN NULL;  -- for statement-level trigger
END
$func$;


CREATE TRIGGER after_insert_prices
AFTER INSERT ON prices_schema.prices
REFERENCING NEW TABLE AS news  -- transition table
FOR EACH STATEMENT
EXECUTE FUNCTION prices_set_all_eff_to();

fiddle

EXECUTE FUNCTION instead of EXECUTE PROCEDURE requires Postgres 11 or later. See:

Note that RETURNING appended to the same INSERT does not show effects of the trigger, yet!

Postgres date/time types allow the special value 'infinity'. I suggest to use that instead of your arbitrary fallback "5999-12-31" (which is not the same as TO_DATE('6000-00-00', 'YYYY-MM-DD').

Note: While this achieves what you ask for, pre-existing rows are not updated to fit in with newly inserted rows.

Only sinlge-row INSERT

CREATE OR REPLACE FUNCTION prices_schema.prices_etl()
  RETURNS trigger
  LANGUAGE plpgsql AS
$func$
BEGIN
   SELECT INTO NEW.eff_to
          p.eff_from - 1
   FROM   prices_schema.prices P
   WHERE  p.id = NEW.id
   AND    p.eff_from > NEW.eff_from
   ORDER  BY p.eff_from
   LIMIT  1;
   
   IF NOT FOUND THEN
      NEW.eff_to = 'infinity';  -- better than arbitrary year 
   END IF;
   
   RETURN NEW;  -- !
END
$func$;


CREATE OR REPLACE TRIGGER before_insert_prices
BEFORE INSERT ON prices_schema.prices
FOR EACH ROW
WHEN (NEW.eff_to IS NULL)            
EXECUTE FUNCTION prices_schema.prices_etl();

fiddle

This simpler BEFORE trigger is very cheap!

It works per row - as indicated by FOR EACH ROW - and only sees rows already inserted earlier (including rows inserted in the same statement). Does not see later rows in a bulk INSERT like the first solution.

The added WHEN clause to the trigger makes sure it only even fires when eff_to IS NULL. Related:

RETURNING appended to the same INSERT already includes all effects of the trigger - as opposed to the above AFTER trigger!

I do not use COALESCE here. If the query finds no row, COALESCE is never executed. That would require additional overhead. See:

Instead, I make use of the special variable FOUND. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you. I do agree with you saying my trigger is highly inefficient. However, I tried running the one you sent, and it doesn't work as expected either. Let alone inability to process bulk inserts, like you mentioned. Also, the trigger *does* need to process multiple rows, not just one - like on the insertion of price 125 in my test data. – mrseemsgood May 21 '23 at 23:26
  • @mrseemsgood: You question shows single-row inserts. Those are covered by my trigger. I added a solution for multi-row inserts now. But that may not cover all cases. See note above. – Erwin Brandstetter May 22 '23 at 03:57
0

Turns out, I was only 2 lines away from one:

CREATE OR REPLACE FUNCTION prices_schema.prices_etl() RETURNS TRIGGER AS $$
BEGIN
    UPDATE prices_schema.prices p0
    SET eff_to = subquery.next_eff_from
    FROM (
        SELECT id, price, COALESCE(
            LEAD(eff_from, 1)
            OVER (
                PARTITION BY id
                ORDER BY eff_from ASC
            ),
            TO_DATE('6000-00-00', 'YYYY-MM-DD')
        ) - 1 AS next_eff_from FROM prices_schema.prices
    ) AS subquery
    WHERE subquery.id = p0.id AND subquery.price = p0.price;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER after_insert_prices
AFTER INSERT ON prices_schema.prices
FOR EACH ROW EXECUTE PROCEDURE prices_schema.prices_etl();

This now works as intended, even if multiple different id values are inserted. Execute:

INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 100, '2017-01-12');
    
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 150, '2017-02-09');
    
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES(1, 125, '2017-01-27');
    
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
    VALUES (2, 900, '2016-05-05');
    
SELECT * FROM prices_schema.prices;

Result:

id price eff_from eff_to
1 100 2017-01-12 2017-01-26
1 150 2017-02-09 5999-12-31
1 125 2017-01-27 2017-02-08
2 900 2016-05-05 5999-12-31

Yes, I know it probably is still heavily unefficient, but it works. It seems to work for bulk inserts, too, but it wasn't my primary focus anyway since I'm inserting data via psycopg2, parsing .json files.

What's more important is that it correctly updates two rows, if two are affected by an insert. Ex.: when a new row with price = 125 is inserted, eff_from fields need to be updated both for this row and for row with price = 100, since we're inserting in between the time intervals.

lemon
  • 14,875
  • 6
  • 18
  • 38