How to pass an array of tuples as parameter for a bulk UPSERT?

Question

With this schema:

CREATE TABLE tag (
  tag_id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  tag_slug text UNIQUE NOT NULL,
  tag_name text NOT NULL
);

I'm currently using Javascript to generate a bulk UPSERT command, like this (it uses knex.raw):

knex.raw(
  `
    INSERT INTO tag (tag_slug, tag_name)
    VALUES ${tags.map(() => `(?, ?)`).join(", ")}
    ON CONFLICT (tag_slug)
    DO UPDATE SET tag_name = excluded.tag_name
    RETURNING *
  `,
  tags.map((t) => [t.slug, t.text]).flat()
)

But I'd like to convert that to a stored procedure called upsert_tags that can be called like this (feel free to adjust the function signature to something equally - or more - ergonomic):

call upsert_tags(
  [
    ('first-tag', 'First Tag'),
    ('second-tag', 'Second Tag')
  ]
);

How can I do this?

This is my best attempt so far but it's definitely not working!

CREATE PROCEDURE upsert_tags(tags array)
LANGUAGE SQL
AS $$
  INSERT INTO tag (tag_slug, tag_name) 
  VALUES (unnest(tags))
  ON CONFLICT (tag_slug)
  DO UPDATE SET tag_name = excluded.tag_name
  RETURNING *
$$

The return value only needs to include the tag_id because I am using it to enter data into a through table to record a many:many relationship.

Something like?: `INSERT INTO tag (tag_slug, tag_name) SELECT t.* from unnest(tags) AS t(tag_slug varchar, tag_name varchar)` — Adrian Klaver, Jun 23 '23 at 20:41
that will not do it without further function https://dbfiddle.uk/8hCasyx1 — nbk, Jun 23 '23 at 20:44
[Postgres 11 is reaching EOL this year](https://www.postgresql.org/support/versioning/). Upgrade at your earliest opportunity. — Erwin Brandstetter, Jun 23 '23 at 23:36

Erwin Brandstetter · Accepted Answer · 2023-06-28T00:37:51.943

`FUNCTION` vs. `PROCEDURE`

RETURNING * indicates You want to return a set of rows - a SETOF tag to be precise. The manual:

Procedures do not return a function value; hence CREATE PROCEDURE lacks a RETURNS clause. However, procedures can instead return data to their callers via output parameters.

Output parameters won't cut it. A PROCEDURE is the wrong choice to begin with. Use a FUNCTION instead. See:

How to return a value from a stored procedure (not function)?

Implementation

There are many possible ways how to format input data.
There are many possible ways how to treat conflicts exactly.
There are many possible ways how to return data.

We can make almost anything work. We can even make much of it dynamic and/or generic to work with varying tables / columns. (Think of possible future changes to the table ...) See:

Refactor a PL/pgSQL function to return the output of various SELECT queries

The best solution depends on what you need exactly. Demonstrating two implementations.

Implementation 1: Passing two separate Postgres arrays

For just two columns it may be convenient to pass two separate arrays. Postgres has a dedicated variant of unnest() to unnest arrays in parallel. See:

Unnest multiple arrays in parallel

So:

CREATE OR REPLACE FUNCTION public.f_upsert_tags1(_tag_slugs text[], _tag_names text[])
  RETURNS SETOF public.tag
  LANGUAGE sql AS
$func$
INSERT INTO public.tag AS t (tag_slug, tag_name) 
SELECT *
FROM   unnest($1, $2)
ON     CONFLICT (tag_slug) DO UPDATE
SET    tag_name = EXCLUDED.tag_name
-- WHERE  t.tag_name <> EXCLUDED.tag_name  -- see below
RETURNING *
$func$;

Call:

SELECT *
FROM   public.f_upsert_tags1('{first-tag, second-tag}'
                           , '{First Tag, Second Tag}');

Returning SETOF public.tag returns complete resulting rows like your original. It introduces a dependency on the row type of the table, which has pros and cons ...

About passing arrays to a function:

Implementation 2: Passing a JSON array of objects

You mentioned Javascript, so it may be convenient to pass a JSON array of objects, which we then decompose with json_populate_recordset(). (There are other options like json_to_recordset() ...)

CREATE OR REPLACE FUNCTION public.f_upsert_tags2(_tags json)
  RETURNS SETOF public.tag
  LANGUAGE sql AS
$func$
INSERT INTO public.tag AS t (tag_slug, tag_name) 
SELECT i.tag_slug, i.tag_name
FROM   json_populate_recordset(null::public.tag, $1) i
ON     CONFLICT (tag_slug) DO UPDATE
SET    tag_name = EXCLUDED.tag_name
-- WHERE  t.tag_name <> EXCLUDED.tag_name  -- see below
RETURNING *
$func$;

Call:

SELECT *
FROM   public.f_upsert_tags2('[{"tag_slug":"first-tag", "tag_name":"First Tag2"}
                             , {"tag_slug":"second-tag", "tag_name":"Second Tag2"}]');

fiddle

Concurrency? Performance?

If the function may be called from multiple transactions concurrently (or concurrent, competing writes on the same table in any way), there are intricate race conditions.

Either of these solutions (including your original) overwrites conflicting rows, even if tag_name does not change, which adds cost doing nothing useful. This matters if it happens a lot. We can skip that, but you still may want to get a complete set of output rows matching the input?

For either of these issues see:

How to use RETURNING with ON CONFLICT in PostgreSQL?

If you also want to know whether each row was inserted or updated, see:

Detect if the row was updated or inserted

Final implementation

Building on your added solution in the comments. For your particular case, this implementation makes sense.

"many conflicts but few actual updates"
concurrent writes can actually compete for same rows
"return value only needs tag_id" for subsequent write to through table to record a many:many relationship

CREATE FUNCTION upsert_tags(_tags json)
   RETURNS TABLE (tag_slug text, tag_id text)
   LANGUAGE sql AS
$func$
INSERT INTO tag AS t (tag_slug, tag_name)
SELECT i.tag_slug, i.tag_name
FROM   json_populate_recordset(null::public.tag, _tags) i
ON     CONFLICT (tag_slug) DO UPDATE
SET    tag_name = EXCLUDED.tag_name
WHERE  t.tag_name <> EXCLUDED.tag_name;  -- omit empty updates

-- New statement to include previously invisible, concurrently written rows
SELECT t.tag_slug, t.tag_id
FROM   json_populate_recordset(NULL::public.tag, _tags)
JOIN   public.tag t USING (tag_slug);  -- JOIN beats IN without need for removing dupes
$func$;

fiddle

I added tag_slug to result rows to allow linking back.

All of this assumes there are no duplicates within your input. Else you need to do more. Related:

How to ignore duplicates during bulk inserts?

And since there are no duplicates in the input, a plain JOIN performs better than IN in the final SELECT. IN would also try to remove duplicates on the right side ...

Finally, about that subsequent write to a through table: You might integrate that into the same function to optimize performance. Related:

Insert data in 3 tables at a time using Postgres

Thank you for the very informative answer! I am trying this out right now. The first solution is very neat looking but it's slightly awkward to group the data by columns before inserting, as you pointed out the JSON solution may be more convenient for me. The `on conflict` considerations are definitely something I did not know _anything_ about. In my case, I expect there to be many conflicts but few actual updates, so it seems like I should follow the advice in the other answer you linked to. — jrz, Jun 26 '23 at 13:50
would this be a reasonable solution to avoid unnecessary updates? ``` CREATE FUNCTION upsert_tags(_tags json) RETURNS SETOF tag LANGUAGE sql AS $$ INSERT INTO tag AS t (tag_slug, tag_name) SELECT i.tag_slug, i.tag_name FROM json_populate_recordset(null::tag, _tags) i ON CONFLICT (tag_slug) DO UPDATE SET tag_name = EXCLUDED.tag_name WHERE t.tag_name <> EXCLUDED.tag_name; select * from tag where tag_slug in ( select tag_slug from json_populate_recordset(null::tag, _tags) ) $$; SELECT * FROM upsert_tags('[{"tag_slug":"stuff "tag_name":"Stuff"}]'); ``` — jrz, Jun 27 '23 at 14:51
@Jonz: Yes, you got the gist of it. I added another chapter with some more improvements and hints. — Erwin Brandstetter, Jun 28 '23 at 00:38