119

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.

How should I go about doing this?

I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143

Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)

Community
  • 1
  • 1
John Bachir
  • 22,495
  • 29
  • 154
  • 227
  • I'm unclear what answer you are looking for which is not already covered in the post you linked to. Is your situation different in some fundamental way? – IMSoP Apr 03 '15 at 18:29
  • 1
    1) The answer did not satisfy the asker of that question 2) I am asking an at least somewhat different question. So I do not know if that answer perfectly applies to my question. – John Bachir Apr 03 '15 at 22:58
  • In Oracle you would use ‘ON COMMIT FAST REFRESH’ along with ‘CREATE MATERIALIZED VIEW LOG’. I wonder if there is something similar for PG. – vaughan Apr 20 '21 at 22:34
  • @vaughan there is an extension that try to mimic that function on oracle see this https://stackoverflow.com/questions/29437650/how-can-i-ensure-that-a-materialized-view-is-always-up-to-date/75123352#75123352 – buncis Jan 15 '23 at 06:35

3 Answers3

320

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right?

Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.

How should I go about doing this?

Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW command does block the view in AccessExclusive mode, so while it is working, you can't even do SELECT on the table.

Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY option:

REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;

This will acquire an ExclusiveLock, and will not block SELECT queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH commands concurrently.

Refresh manually

It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH operation in the end of it.

Scheduling the REFRESH operation

The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:

*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"

And then your materialized view will be refreshed at each 30 minutes.

Considerations

This option is really good, specially with CONCURRENTLY option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY, the REFRESH command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH.

Refreshing with a trigger

Another option is to call the REFRESH MATERIALIZED VIEW in a trigger function, like this:

CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
    REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
    RETURN NULL;
END;
$$;

Then, in any table that involves changes on the view, you do:

CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();

Considerations

It has some critical pitfalls for performance and concurrency:

  1. Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
  2. Even with CONCURRENTLY, one REFRESH still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.

The only situation I can think that as a good idea is if the changes are really rare.

Refresh using LISTEN/NOTIFY

The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY operation:

CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
    NOTIFY refresh_mv, 'my_mv';
    RETURN NULL;
END;
$$;

So then you can build an application that keep connected and uses LISTEN operation to identify the need to call REFRESH. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN, so you can schedule the REFRESH as:

pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"

Or use pglater (also inside pgsidekick) to make sure you don't call REFRESH very often. For example, you can use the following trigger to make it REFRESH, but within 1 minute (60 seconds):

CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
    NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
    RETURN NULL;
END;
$$;

So it will not call REFRESH in less the 60 seconds apart, and also if you NOTIFY many times in less than 60 seconds, the REFRESH will be triggered only once.

Considerations

As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.

OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.

Park JongBum
  • 1,245
  • 1
  • 16
  • 27
MatheusOl
  • 10,870
  • 3
  • 30
  • 28
  • 5
    Epic answer! I see that updating on each operation won't be an option for me. I was hoping to use the materialized view as a fancy complicated index. But it doesn't go along with other expectations a programmer has about db usage. So I guess this is why this "auto update" functionality hasn't been built into the feature. – John Bachir Apr 05 '15 at 17:19
  • 1
    "*if few rows have changed, then it might be faster*" - no it won't. Postgres always rebuilds the entire MVIEW there is no incremental update, so the number *changed* rows does not affect the refresh time. –  Apr 06 '15 at 17:13
  • 6
    @a_horse_with_no_name, I think you are mistaken, if you use `CONCURRENTLY` it `DELETE` old/changed rows and `INSERT` new/changed rows, but it doesn't touch non-changed rows. See [refresh_by_match_merge function in PG's source code](http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/matview.c;hb=HEAD#l512). In any case it does execute the entire underlying query. – MatheusOl Apr 06 '15 at 18:13
  • 2
    @MatheusOl a_horse_with_no_name is correct. The entire materialized view is recalculated when using concurrently. Then postgres will use the unique index on the materialized view to find which rows have changed and only update those rows from the recalculated values. This prevents blocking of the entire view, but it does not speed up calculation time. – sage88 May 01 '17 at 23:13
  • @sage88 I think you just said the same as I did.Notice that I said it *might* be faster, but I do agree that it will rarely be, it may be in few cases where the "write" part is slow, and it also avoids too much writes. – MatheusOl May 02 '17 at 17:20
  • In fact it is probably slower in most cases, as it needs more work than TRUNCATE+INSERT. The huge advantage of it is avoiding AccessExclusive lock. There is no really "incremental update", but there is a "diff" process that always do the query and save into a temp table, then inserting new tuples, deleting removed ones and updating changed ones. Did I say it avoids whole query execution? I don't see where I'm wrong. – MatheusOl May 02 '17 at 17:35
  • @JohnBachir, it's not a fancy complicated index; it's a cached view. If your view is expensive but results are not huge it's perfect. If your results are huge a materialized view as far as I know doesn't solve anything (I haven't verified completely but I don't *think* you can add indexes to a materialized view). – Wildcard Oct 17 '17 at 02:29
  • 1
    @Wildcard Materialized views do allow indices, and `CONCURRENTLY` actually requires a `UNIQUE` index: This option is only allowed if there is at least one UNIQUE index on the materialized view which uses only column names and includes all rows; that is, it must not index on any expressions nor include a WHERE clause. https://www.postgresql.org/docs/current/static/sql-refreshmaterializedview.html#idm46428673274272 – msand Dec 05 '17 at 10:17
  • Building pgsidekick in Ubuntu(16_04) is very smooth. Thank you. – Park JongBum Jan 17 '18 at 02:08
  • upvoted, lets say you had a table called user_likes with uuid of the post, user id as integer and vote of true as like and false as dislike, where the primary key is composite made up of post uuid and user's id, if you wanted a materialized view that has aggregate likes and dislikes for each post, will the addition of a new row in this table RECALCULATE the entire view? – PirateApp Feb 24 '20 at 11:52
  • 1
    All I wanted/needed to know about Postgres' Materialized Views none answer! – Luis Sisamon Nov 11 '20 at 05:53
11

Now there is a PostgreSQL extension to keep materialized views updated: pg_ivm.

It only computes and applies the incremental changes, rather than recomputing the contents fully as REFRESH MATERIALIZED VIEW does. It has 2 approaches, IMMEDIATE and DEFERRED:

  • For IMMEDIATE, the views are updated in the same transaction that its base table is modified.

  • For DEFERRED, the views are updated after the transaction is committed.

Version 1.0 has been released on 2022-04-28.

tanius
  • 14,003
  • 3
  • 51
  • 63
buncis
  • 2,148
  • 1
  • 23
  • 25
0

Let me point out three things on the previous answer by MatheusOl - the pglater technology.

  1. As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_long by the phrase "The last element of the array has to be filled with zeros." So, it should read -

    static struct option long_options[] =     {
          //......
          {"help", no_argument, NULL, '?'},
          {0, 0, 0, 0} 
    };
    
  2. On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both char and char**). You many need to force type conversion while you do that(pointer assignment).

    char block4[100];
    ...
    password_prompt = block4;
    ...
    char block1[500];
    const char **keywords = (const char **)&block1;
    ...
    char block3[300];
    char *listen = block3;
    sprintf(listen, "listen %s", id);
    PQfreemem(id);
    res = PQexec(db, listen);
    
  3. Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.

    when md >= callback_delay(cd) ==> timeout: 0

    when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]

    when md + PING_INTERVAL < cd ==> timeout: PI

To implement this algorithm(3rd point), you should init 'lr' as follows -

res = PQexec(db, command);
latest_refresh = time(0);
if (PQresultStatus(res) == PGRES_COMMAND_OK) {
Park JongBum
  • 1,245
  • 1
  • 16
  • 27