UPDATE to remove duplicates from comma-separated list

Question

In my Postgres db, I have a table with some columns where one column B has values as follows:

A      B
a1     b0, b1, b1
a2     b2, b3
a3     b4, b4, b5, b5

My expected output should be like this:

A      B
a1     b0, b1
a2     b2, b3
a3     b4, b5

Is there a simple UPDATE in Postgres to do this?

What is the type of column `B`. Is it a text field, json or array of something? The update function exists, but depends on the type of `B` — Haleemur Ali, May 29 '19 at 01:10
you shouldn't be storing comma separated values in the first place. That is a really bad design. — , May 29 '19 at 01:10
Any such question *needs* to provide actual table definition showing data types and constraints and the Postgres version in use. Also, why the redundant format of `b0, b1, b1` instead of just `0, 1, 1`? And why not an array or a normalized DB design? And: fold only consecutive duplicates or all? Preserve order of elements? — Erwin Brandstetter, May 29 '19 at 01:24

score 2 · Answer 1 · answered May 29 '19 at 01:24

There are two ways you could achieve this:

An UPDATE statement that unnests the elements and aggregates them back while removing duplicates:

update bad_design
  set b = t.b
from (
  select a, string_agg(distinct trim(t.w), ',') b
  from bad_design
    cross join unnest(string_to_array(b, ',')) as t(w)
  group by a
) t
where t.a = bad_design.a;

The other option if you need to do that very often, is to create a function that removes duplicates and use that in an UPDATE:

create or replace function cleanup(p_list text)
  returns text
as
$$
  select string_agg(distinct trim(t.w), ',')
  from unnest(string_to_array(p_list, ',')) as t(w);
$$
language SQL;

Then you can use that like this:

update bad_design
  set b = cleanup(b);

The reason why this is so complicated is the fact that you did not properly normalize your data model. With a properly normalized model, you could simply create a UNIQUE index and prevent duplicates from being inserted in the first place.

score 2 · Accepted Answer · answered May 29 '19 at 02:48

The specific update statement depends on the type of column b, but there are really only 3 different ways this data could be stored, in a delimited string, an text array or a json

The update statement for the comma separated text field would be:

update mytable
set b = array_to_string(array(select distinct unnest(string_to_array(b, ', '))), ', ');

If b is an text array then:

update mytable
set b = array(select distinct unnest(b));

If b is a json array then:

update mytable
set b = array_to_json(array(select distinct value from json_array_elements_text(b)));

As you can see, the cleanest statement in is case results from the data being stored as a text array. If you must store an array of values in 1 column, do it using an array type.

However, I would also recommend normalizing your data.

These statements above will update all rows in the table, thus incurring a higher execution cost. I'll illustrate a way to reduce updates using the text array variant (as that requires the shortest sql query):

update mytable
set b = array(select distinct unnest(b))
where array_length(b_array, 1) != (select count(distinct c) from unnest(b) c);

score 0 · Answer 3 · answered May 29 '19 at 01:43

Assuming: current version of Postgres, data type of B is text, separator in your list is ', ', remove all duplicates, preserve original order of elements, most or all rows have duplicates:

Then this should be simplest and fastest:

UPDATE tbl
SET    B = array_to_string( ARRAY(
         SELECT elem
         FROM   unnest(string_to_array(B, ', ')) WITH ORDINALITY x(elem, ord)
         GROUP  BY 1
         ORDER  BY min(ord)
         ), ', ');

The drawback: all rows are updated, even if nothing changes. If that affects more than a few rows, it can be avoided ...

If any assumptions do not hold, there may be a different / better / faster solution.

UPDATE to remove duplicates from comma-separated list

3 Answers3