0

The following procedure is run on a table with 20M records. It stopped at ~560000 and I don't know why. I'm still learning DB monitoring/troubleshooting. This is on Aurora Postgres Serverless v2 with 6 ACUs (12 GB mem).

The content is html in a text column.

create or replace procedure schema_v2_update_href_again()
language plpgsql
as $$
declare
  max_id bigint;
begin

  select max(version_id) 
  from versions
  into max_id;
  
  for id in 0..max_id loop
    update versions
    set content = REGEXP_REPLACE(content, 'href="./US/eng/1/article/1/', 'href="./US/eng/1/article/', 'g')
    where version_id = id;
    
    if id % 10000 = 0 then
      raise notice 'id: %', id;
      commit;
    end if;

  end loop;

end; $$;

Should I be looking at the DB monitoring tools or could this be something that I've missed in the plpgsql?

The goal of the query is to go through each record and replace a url pattern in the html. There is no performance requirement. I'm not trying to be fast or efficient, I just wanted a simple procedure. Is this method naive?

Update #1: schema for versions table

create table versions(
  version_id bigserial primary key,
  group_address entity_address not null default (0, 0),
  part_id bigint not null references parts(part_id),
  created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  content_type smallint not null,
  content text not null,
  root_entity_id bigint,
  root_entity_table table_name
);

create index on versions using hash (content);
create index on versions (part_id);
Michael220
  • 193
  • 12
  • The loop is based on the assumption that `max(version_id)` = total # of rows. Have you verified that is the case? – Adrian Klaver Jan 24 '23 at 17:50
  • Is the intent to update the whole table without locking it? Please show us your schema, and indexes. – Schwern Jan 24 '23 at 17:57
  • @AdrianKlaver good question, actually it is a safe assumption. no deletes on the `versions` table. @Schwern schema and indexes added. Yes, we don't want to lock the table, but its ok for many rows to be locked at a time as there are so many and are rarely accessed by users. – Michael220 Jan 24 '23 at 21:00
  • The thing is a `serial` type can get gaps in it without deletes. All you have to do is have failed inserts, as the sequence behind the `serial` will increment regardless. And given the `cache` value of the sequence this may jump by more then one. I would still verify the actual number of rows. – Adrian Klaver Jan 25 '23 at 00:25
  • @AdrianKlaver good point. will do. thx. it wouldn't cause problems though, right? – Michael220 Jan 25 '23 at 00:49
  • Other then doing `Update ... version_id = id` for non-existent `id`'s. – Adrian Klaver Jan 25 '23 at 01:02

3 Answers3

2

If the table is large, this will be extremely slow issuing one update per row.

If the intent is to update the whole table without locking it, you can do this by updating in batches. One update per range of 10,000 IDs.

  id := 1;
  batch_size := 10000;
  while id <= max_id loop
    update versions
    set content = REGEXP_REPLACE(content, 'href="./US/eng/1/article/1/', 'href="./US/eng/1/article/', 'g')
    where version_id between next_id and id + batch_size;
    
    raise notice 'id: %', id;
    commit;
    id = id + batch_size;
  end loop;

Something like that, the syntax might not be totally correct.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Hey @Schwern, thank you for the tips, they did indeed make the proc run faster, though it still hit some sort of wall and stopped. It ran until 8,090,000 and simply stopped processing more records without exiting the proc. Monitoring showed CPU and all other usage just drop. I must have some sort of memory thing going on, right? – Michael220 Jan 24 '23 at 20:51
  • @Michael220 That's very strange. There's nothing in that function which would be consuming memory. My only thought is that somehow you have an infinite loop, but I don't see one. What's your max_id? – Schwern Jan 24 '23 at 23:45
  • @Michael220 Something might be holding a lock on rows in that table and blocking the update. You might be deadlocked. See https://stackoverflow.com/questions/26596566/postgres-database-locked-queries-running-forever and related answers for how to check. – Schwern Jan 24 '23 at 23:49
  • thanks @Schwern. current max is 20,828,841. Would I run into issues with indexing the content column and not hosting the db on a large enough instance? When the loop stops I end the session and start it again just fine. It feels like a memory leak. I'm content now with just running the proc in batches to get the job done. Thanks for the link. – Michael220 Jan 24 '23 at 23:56
  • I will say I have checked locks and nothing comes up. Sessions aren't flagged red in pgAdmin. No locks on the table in the lock list. – Michael220 Jan 24 '23 at 23:58
  • and there is currently no other activity in the db – Michael220 Jan 24 '23 at 23:59
  • @Michael220 20 million rows is not a lot. I don't see how this could be a memory issue unless you see the memory usage on your Postgres server spiking. If you've ruled out locks or an infinite loop, I'm out of ideas. Maybe changing so many rows is tripping a poorly configured autovacuum that's locking things? – Schwern Jan 25 '23 at 01:11
0

Not sure yet if this is the fix, but I noticed that a triggered proc that ran after the one I posted would raise notice hundreds of times per insert. I then remembered that in the "Data Output" section of pgAdmin there was an EOF line that didn't make sense to me (didn't think to mention that, woops). I'm thinking now the driver got too many messages back from running the call that it quit. Or maybe it happened on the server. I'm too new to Postgres to reason through this part so maybe someone knows as to whether there is a limit to print statements when running a procedure.

Michael220
  • 193
  • 12
0

Do you really need to raise notices ?

If not, you could do the job with a basic sql statement with better performance while avoiding errors due to the loop, couldn't you ?

update versions
set content = REGEXP_REPLACE(content, 'href="./US/eng/1/article/1/', 'href="./US/eng/1/article/', 'g')
where version_id >= 0

or even embedded this statement in a sql procedure :

create or replace procedure schema_v2_update_href_again()
language sql
as $$
    update versions
    set content = REGEXP_REPLACE(content, 'href="./US/eng/1/article/1/', 'href="./US/eng/1/article/', 'g')
    where version_id >= 0 ;
$$;
Edouard
  • 6,577
  • 1
  • 9
  • 20