0

I am reading through a large list of data and adding it to a PostgreSQL database. The problem is, sometimes there are duplicate values in the data I'm reading in, but sometimes they fill in data that was missing previously. To remedy this, I have added the following to my script, but it is very ugly:

INSERT INTO tab(id,col1,col2,col3,...) VALUES (i,v1,v2,v3,...)
ON CONFLICT (id)
    DO UPDATE 
        SET 
            (col1,col2,col3, ...)=(
                COALESCE(tab.col1, EXCLUDED.col1),
                COALESCE(tab.col2, EXCLUDED.col2),
                COALESCE(tab.col3, EXCLUDED.col3),
                ...
             );

I'm hoping there's a more elegant solution than manually writing out every column in the table. I have a few more tables I need to write these for so I'd prefer if there was a more generic way to do this.

EDIT: I'm pretty new to this, and reading the docs, this may be a dumb way to do this in the first place. Please let me know if I should even be using the INSERT command for this, it looks like maybe just UPDATE or some form of JOIN could accomplish the same thing?

Postgres version: psql (PostgreSQL) 12.14 (Ubuntu 12.14-0ubuntu0.20.04.1)

Zaya
  • 316
  • 4
  • 14
  • 2
    No, spelling out the columns that you want to insert/update is a good practice and is the way to go. – Bergi Mar 30 '23 at 22:47
  • 2
    You lead with *"I am reading through a large list of data"*. But then you show a single-row UPSERT. Shouldn't that be a multi-row command (with many input rows?). Can you consolidate the whole operation into a single command? (Then I would suggest a very much different approach, cheaper approach. Please always declare your version of Postgres. – Erwin Brandstetter Mar 30 '23 at 23:27
  • @ErwinBrandstetter I have edited the OP to show the version, 12.14. And possibly. But often I have to add to different tables, the data is all mixed together in the bigger file I'm parsing. But if it's faster to read through the file |Tables| times and do a single insert for each, I'd be interested in the other approach – Zaya Mar 31 '23 at 00:18

2 Answers2

1

It's necessary to list every column, but it's not necessary to enter the list manually. The following query is an example of using column information found in information_schema to generate the SET clause:

WITH query_fragments AS (
    SELECT
        string_agg(quote_ident(c.column_name), ', ' ORDER BY c.ordinal_position) AS column_list,
        string_agg(format('COALESCE(tab.%I, excluded.%I)', c.column_name, c.column_name), ', ' ORDER BY c.ordinal_position) AS column_values
    FROM
        information_schema.columns c
    WHERE
        c.table_schema = 'public'
        AND c.table_name = 'tab'
        AND c.column_name <> 'id'
)
SELECT
    format('SET (%s) = (%s)', column_list, column_values) AS set_clause
FROM
    query_fragments;

This could easily be incorporated into a function and expanded to generate the entire insert query for each table.

JohnH
  • 2,001
  • 1
  • 2
  • 13
1

Your query looks mostly good. As does John's meta-query to build it.

One major issue remains: don't update rows that don't actually change. That accrues the full update cost for no gain.

INSERT INTO tab AS t
       (id, col1, col2, col3)
VALUES (i , v1  , v2  , v3  )
ON     CONFLICT (id) DO UPDATE 
SET   (col1, col2, col3) =
      (COALESCE(t.col1, EXCLUDED.col1),
       COALESCE(t.col2, EXCLUDED.col2),
       COALESCE(t.col3, EXCLUDED.col3))
WHERE  EXCLUDED IS DISTINCT FROM t;

Better (but more verbose):

...
WHERE (col1, col2, col3) IS DISTINCT FROM 
      (COALESCE(t.col1, EXCLUDED.col1),
       COALESCE(t.col2, EXCLUDED.col2),
       COALESCE(t.col3, EXCLUDED.col3));     -- !!!

The first suggestion only suppresses updates where the input row is exactly the same as the existing row.
The second (better) suggestion suppresses all empty updates.

Related:

Superior approach

My above suggestions help to minimize the number of expensive updates.
If you can manage without adding too much overhead, don't update the same row repeatedly at all. Consolidate multiple input rows into a single row before applying the UPDATE.

Like:

INSERT INTO tab AS t
      (id, col1   , col2   , col3)
SELECT i , min(v1), min(v2), min(v3)
FROM   my_input_rows i
GROUP  BY 1
ON     CONFLICT (id) DO UPDATE 
SET   (col1, col2, col3) =
      (COALESCE(t.col1, EXCLUDED.col1),
       COALESCE(t.col2, EXCLUDED.col2),
       COALESCE(t.col3, EXCLUDED.col3))
WHERE (col1, col2, col3) IS DISTINCT FROM 
      (COALESCE(t.col1, EXCLUDED.col1),
       COALESCE(t.col2, EXCLUDED.col2),
       COALESCE(t.col3, EXCLUDED.col3));

You might first write to a TEMPORARY staging table my_input_rows and take it from there. Or use VALUES expression like in your initial code, just not directly attached to the INSERT, so you may need explicit type casts. See:

I chose min() as it ignores null input. If you either have null or one distinct value per set, that does it.

This can even be done for multiple target tables in a single SQL statement, employing multiple data-modifying CTEs.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228