2

Below is a great function to check the real count of all tables in PostgreSQL database. I found it here.

From my local test, it seems that the function returns the all result only after it finished all counting for 100 tables.

I am trying to make it more practical. If we could save the result of each table counting as soon as it finished with the table, then we can check the progress of all counting jobs instead of waiting for the end.

I think if I could UPDATE the result in this function immediately after finishing the first table, it will be great for my requirement.

Can you let me know how I can update the result into the table after this function finishes the counting of the first table?

CREATE FUNCTION rowcount_all(schema_name text default 'public')
  RETURNS table(table_name text, cnt bigint) as
$$
declare
 table_name text;
begin
  for table_name in SELECT c.relname FROM pg_class c
    JOIN pg_namespace s ON (c.relnamespace=s.oid)
    WHERE c.relkind = 'r' AND s.nspname=schema_name
    ORDER BY c.relname
  LOOP
    RETURN QUERY EXECUTE format('select count(*) from %I.%I',
       table_name, schema_name, table_name);
  END LOOP;
end
$$ language plpgsql;

-- Query

WITH rc(schema_name,tbl) AS (
  select s.n,rowcount_all(s.n) from (values ('schema1'),('schema2')) as s(n)
)
SELECT schema_name,(tbl).* FROM rc;

Updated

I have decided to use a shell script to run the function below as a background process. The function would generate a processing log file so that I can check the current process.

Community
  • 1
  • 1
Sigularity
  • 917
  • 2
  • 12
  • 28

1 Answers1

1

I think your idea is good, but I also don't think it will work "out of the box" on PostgreSQL. I'm by no means the expert on this, but the way MVCC works on PostgreSQL, it's basically doing all of the DML in what can best be understood as temporary space, and then if and when everything works as expected it moves it all in at the end.

This has a lot of advantages, most notably that when someone is updating tables it doesn't prevent others from querying from those same tables.

If this were Oracle, I think you could accomplish this within the stored proc by using commit, but this isn't Oracle. And to be fair, Oracle doesn't allow truncates to be rolled back within a stored proc the way PostgreSQL does, so there are gives and takes.

Again, I'm not the expert, so if I've messed up a detail or two, feel free to correct me.

So, back to the solution. One way you COULD accomplish this is to set up your server as a remote server. Something like this would work:

CREATE SERVER pgprod
FOREIGN DATA WRAPPER dblink_fdw
OPTIONS (dbname 'postgres', host 'localhost', port '5432');

Assuming you have a table that stores the tables and counts:

create table table_counts (
   table_name text not null,
   record_count bigint,
   constraint table_counts_pk primary key (table_name)
);

Were it not for your desire to see these results as they occur, something like this would work, for a single schema. It's easy enough to make this all schemas, so this is for illustration:

CREATE or replace FUNCTION rowcount_all(schema_name text)
  returns void as
$$
declare
 rowcount integer;
 tablename text;
begin
  for tablename in SELECT c.relname FROM pg_class c
    JOIN pg_namespace s ON (c.relnamespace=s.oid)
    WHERE c.relkind = 'r' AND s.nspname=schema_name
    ORDER BY c.relname
  LOOP
    EXECUTE 'select count(*) from ' || schema_name || '.' || tablename into rowcount;
    insert into table_counts values (schema_name || '.' || tablename, rowcount)
    on conflict (table_name) do 
    update set record_count = rowcount;
  END LOOP;
end
$$ language plpgsql;

(this presupposes 9.5 or greater -- if not, hand-roll your own upsert).

However, since you want real-time updates to the table, you could then put that same upsert into a dblink expression:

    perform dblink_exec('pgprod', '
         << your upsert statement here >>
    ');

Of course the formatting of the SQL within the DBlink is now a little extra tricky, but the upside is once you nail it, you can run the function in the background and query the table while it's running to see the dynamic results.

I'd weigh that against the need to really have the information real-time.

Hambone
  • 15,600
  • 8
  • 46
  • 69
  • Wonderful approach. You mean that without DBlink implementation, I can't query table_counts dynamically as it would commit after all counting jobs are finished? – Sigularity Aug 01 '16 at 05:46
  • My external application could query table_counts tables every 10 seconds, so I guess no need to implement DBlink? – Sigularity Aug 01 '16 at 05:54
  • I might be asking about 'autonomous transaction' in the function. I can't query the table while function is working. – Sigularity Aug 01 '16 at 06:29
  • 1
    The only thing the DBlink adds is the notion of an immediate transaction that occurs outside of the transaction that is represented by the function... it's basically a hack to the inability to do mini-commits within the function. I've used it, but I would use it sparingly. Maybe try the function as-is and see if it meets your needs without the dblink? – Hambone Aug 01 '16 at 11:58
  • Yes, without DBlink, I am thinking of running the function as a background process though a shell script, and it generates count results so that I can find the current one. Tons of help. Thank you so much!!! – Sigularity Aug 02 '16 at 00:50