Only update one row per group of peers

Question

I have got problem with sub-query plan in PostgreSQL. It seems that in my query sub-query is evaluated only once at the beginning of execution. I have got a simple table with name column where I would like to execute a simple UPDATE.

Main task is to get rid of diacritic marks and spaces. However, multiple versions of the same word can exist. For example: 'poleć' with one space at the end and with two spaces or three spaces sometimes. I would like to change only one of them in case more version of this same words exists. When I run this query:

update newtable
set name = translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
where translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
not in (select sis.name from newtable as sis where sis."name" is not null)

all variants of this same word are changed, even though I try to avoid it in sub-query. Is there a way to re-evaluate a sub-query once per row.

Structure of a table is very simple.

ID	NAME
1	poleć
2	poleć

Erwin Brandstetter · Accepted Answer · 2022-07-31T00:00:18.657

Is there a way to re-evaluate a sub-query once per row.

Yes, you can do that with a subquery expression in the assignment - a "correlated subquery". Like:

UPDATE t SET col = (SELECT col FROM t2);

But that's not going to solve your problem at all, as every re-evaluation still sees the same snapshot of the underlying tables from the start of the UPDATE command. You would have to update one row at a time, which is comparatively expensive.

This does something like you ask for:

UPDATE newtable t
SET    name = u.new_name
FROM  (
   SELECT DISTINCT ON (new_name) *  -- one per group, unchanged first
   FROM  (
      SELECT id, name
           , translate(regexp_replace(trim(name), '\s\s+', ' ', 'g')
                     , 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ'
                     , 'acelnoszzACELNOSZZ') AS new_name
      FROM   newtable
      ) sub
   ORDER  BY new_name, name <> new_name, id
   ) u
WHERE u.id = t.id
AND   u.name <> u.new_name;  -- only if actual change

db<>fiddle here

The subquery u picks one row from every set that would end up with the same name, using DISTINCT ON. See:

Select first row in each GROUP BY group?

Suspecting a UNIQUE constraint on name, I prioritize rows that don't change at all. That's done by name <> newname, where false sorts before true. See:

Time based priority in Active Record Query

Only the one row per group is updated - and only if it actually changes. This way, name stays UNIQUE - as long as there are no concurrent writes that might interfere.

It's unclear if that's what you actually need, I filled in gaps in the task description with educated guesses.

While being at it, I made the string expression a lot cheaper.

Only update one row per group of peers

1 Answers1